I received a textfile created with Notepad++ that I'm trying to read with a Fortran 95 program on both a Mac and a PC. The read line is:
read(lun,'(a)',iostat=io1) input
Since I don't know what the line lengths are I defined input to be 512 in length. With non-notepad++ files when the end of line is found the read "stops" and automatically advances to the next line of text. With the notepad++ file, it reads 512 characters, skipping over the carriage returns. When I open the file using the dos editor on the pc I see carriage return symbols (ASCII char 13) but there is no break between lines, they are all appended to one another.
I've tried searching for ichar(13) and ichar(10), backspacing to the beginning of the line and trying to force an advance to the next line; reading in with format '(a,/')', but haven't been able to get anything to work.
What you need is a pipeline type design. The basic routine is one called getline, which gets a line of data up to the carriage return. Inside the initialization, what you do is open the file as a binary file and read a buffer of say 1024 characters in. Whenever getline is called, return the next lot of characters until you get to a CR. If there aren't enough characters, move the unprocessed characters to the front and read in the remaining characters.
It is basically how compilers work - they get a stream of tokens, which, in your case is a string of characters ending with a CR, and then they process the tokens.
Related
If I have a text file where lines contains some non-blank characters followed by spaces, how do I read those lines into a character variable without excess spaces?
character (len=1000) :: text
open (unit=20,file="foo.txt",action="read")
read (20,"(a)") text
will read the first 1000 characters of a line into variable text, which will be padded with spaces at the end if there are fewer than 1000 characters in the line. But if the line length is 100 you have 900 extraneous spaces, and the program does not "know" how long the line read actually was.
Fortran strings are blank-padded. There is simply no chance to distinguish any significant blank-padding in your strings with constant-length Fortran strings.
If every whitespace character is important, I suggest to treat the file as a stream-access file instead (formated or unformatted as needed), read individual characters to some array buffer and allocate a deferred-length string only after you know the length you actually need.
character (len=1000) :: text
integer :: s, ios
open (unit=20,file="foo.txt",action="read")
read (20,"(a)", size=s, advance='no', iostat=ios) text
After that last line, s contains the number of characters read, including trailing spaces, which I think is what you wanted.
Notes:
With a size tag, you must also have an advance tag set to 'no' otherwise you get a compilation error. Since the format is "(a)", the whole line is read so the next read statement will advance to the next line despite the 'no'. That's fine.
ios stores a negative integer when attempting to read past the end of the line. This will always happen if the line is shorter than length of text. That's fine.
When attempting to read past the end of the file, ios will store a different negative integer. What those two negative integers are is not set by the standard I think so you may have to experiment a bit. In my case, with the gfortran compiler, ios was -1 when attempting to read past the end of the file and -2 otherwise.
I have some code to read a text based file format in that it checks for empty line with:
line == ""
where line is a string that receives a text line obtained through getline.
It worked with my own text based file format, but it did not work with another text based file format (not mine)
I opened the file with gedit and saw nothing. More and less utilities also did not show anything. Then I tried vi and it showed:
^M on all these lines that seemed empty until now (a screenshot of it is here: .
Did some research and it seems that opening the file in text mode, all I needed to do was to compare it to '\n'. So I wrote the line:
if (line[0] == '^M' || line[0] == '\n')
break;
to end a while loop where this "if" is inside, but it did not work. What do I need to do?
As you have already surmised, those ^Ms are vi's way of showing you that there are carriage return characters at the end of each line. The file probably originated on Windows.
As other commentators have mentioned, the way a carriage return character is represented in C / C++ is '\r', and the line endings in that particular file will almost certainly actually be \r\n (CRLF).
So, now you know how it all works you have some code to write. getline will remove the \n but you'll have to strip the \r (if there is one) off the end of the line yourself. Go to it.
I load the text file .txt using the LoadFromFile() function, and the text in the middle of the line is marked with a newline '\n'.
The LoadFromFile() function treats this character as a new line and divides the line in that place by creating a new line.
In the Windows system Note the text looks like this: **Ala has ace**
The program that loads this file looks different:
plik->LoadFromFile( path, TEncoding::ASCII);
for( short int i = 0; i < plik->Count; ++i )
Memo1->Lines->Add( plik->Strings[i] );
In Memo1 the text looks like this:
**Ala**
**has ace**
Can I remove the '\n' character to make the entire line and how?
I answered this same question on the Embarcadero forums earlier today, but I will answer it here, too.
plik is a TStringList (according to the other discussion), so its LoadFrom...() method treats bare-CR, bare-LF, and CRLF line breaks equally when the TStrings::LineBreak property matches the RTL's global sLineBreak constant. If the LineBreak property does not match sLineBreak, then TStrings only splits on line breaks that match its LineBreak property.
Since the RTL's sLineBreak constant is CRLF on Windows, and you don't
want to split on bare-LF line breaks, you are going to have to parse
the file data manually, not use TStrings::LoadFromFile() at all.
For instance, you could read the whole file into a System::String using the System::Classes::TStreamReader::ReadToEnd() or System::Ioutils::TFile::ReadAllText() method (TStreamReader and TFile both have methods for reading lines, but they both treat all three forms of line break equally), and then parse that String to extract CRLF-delimited substrings while ignoring any bare-LF characters.
Ideally, you would load a file into a TMemo by using its own LoadFromFile() method. But, in this situation, that will not work, either, because TMemo normalizes all three forms of line breaks to CRLF before passing the data to the Win32 API, so that is not useful to you.
Now I am quite confused about the end of line character I am working with c++ and I know that text files have a end of line marker which sets the limit for reading a line which a single shifing operator(>>).Data is read continously untill eol character does not apprears and while opening a file in text mode carriage return(CR) is converted into CRLF which is eol marker so if i add white spaces in my text then would it act as eol maker cause it does.
Now i created a normal file i.e. a file without .txt
eg
ifstream("test"); // No .txt
Now what is eol marker in this case
The ".txt" at the end of the filename is just a convention. It's just part of the filename.
It does not signify any magical property of the file, and it certainly doesn't change how the file is handled by your operating system kernel or file system driver.
So, in short, what difference is there? None.
I know that text files have a end of line marker which sets the limit for reading a line which a single shifing operator(>>)
That is incorrect.
Data is read continously untill eol character does not apprears
Also incorrect. Some operating systems (e.g. Windows IIRC) inject an EOF (not EOL!) character into the stream to signify to calling applications that there is no more data to read. Other operating systems don't even do that. But in neither case is there an actual EOF character at the end of the actual file.
while opening a file in text mode carriage return(CR) is converted into CRLF which is eol marker
That conversion may or may not happen and, either way, EOL is not EOF.
if i add white spaces in my text then would it act as eol maker cause it does.
That's a negative, star command.
I'm not sure where you're getting all this stuff from, but you've been heavily mistaught. I suggest a good, peer-reviewed, well-recommended book from Amazon about how computer operating systems work.
When reading strings in C++ using the extraction operator >>, the default is to skip spaces.
If you want the entire line verbatim, use std::getline.
A typical input loop is:
int main(void)
{
std::string text_from_file;
std::ifstream input_file("My_data.txt");
if (!input_file)
{
cerr << "Error opening My_data.txt for reading.\n";
return EXIT_FAILURE;
}
while (input_file >> text_from_file)
{
// Process the variable text_from_file.
}
return EXIT_SUCCESS;
}
A lot of old and mainframe operating systems required a record structure of all data files which, for text files, originated with a Hollerith (punch) card of 80 columns and was faithfully preserved through disk file records, magnetic tapes, output punch card decks, and line printer lines. No line ending was used because the record structure required that every record have 80 columns (and were typically filled with spaces). In later years (1960s+), having variable length records with an 80 column maximum became popular. Today, even OpenVMS still requires the file creator to specify a file format (sequential, indexed, or "stream") and record size (fixed, variable) where the maximum record size must be specified in advance.
In the modern era of computing (which effectively began with Unix) it is widely considered a bad idea to force a structure on data files. Any programmer is free to do that to themselves and there are plenty of record-oriented data formats like compiler/linker object files (.obj, .so, .o, .lib, .exe, etc.), and most media formats (.gif, .tiff, .flv, .mov, mp3, etc.)
For communicating text lines, the paradigm is to target a terminal or printer and for that, line endings should be indicated. Most operating systems environments (except MSDOS and Windows) use the \n character which is encoded in ASCII as a linefeed (ASCII 10) code. MSDOS and ilk use '\r\n' which are encoded as carriage return then linefeed (ASCII 13, 10). There are advantages and disadvantages to both schemes. But text files may also contain other controls, most commonly the ANSI escape sequences which control devices in specific ways:
clear the screen, either in part or all of it
eject a printer page, skip some lines, reverse feed, and other little-used features
establish a scrolling region
change the text color
selecting a font, text weight, page size, etc.
For these operations, line endings are not a concern.
Also, data files encoded in ASCII such as JSON and XML (especially HTML with embedded Javascript), might not have any line endings, especially when the data is obfuscated or compressed.
To answer your questions:
I am quite confused about the end of line character I am working with c++ and I know that text files have a end of line marker
Maybe. Maybe not. From a C or C++ program's viewpoint, writing \n indicates to the runtime environment the end of a line. What the system does with that varies by runtime operating environment. For Unix and Linux, no translation occurs (though writing to a terminal-like device converts to \r\n). In MSDOS, '\n' is translated to \r\n. In OpenVMS, '\n' is removed and that record's size is set. Reading does the inverse translation.
which sets the limit for reading a line which a single shifing operator(>>).
There is no such limit: A program can choose to read data byte-by-byte if it wants as well as ignore the line boundaries.
The "shifting operators" are overloaded for filestreams to input or output data but are not related to bit twiddling shifts. These operators were chosen for visual approximation of input/output and due to their low operator precedence.
Data is read continously untill eol character does not apprears
This bit is confusing: I think you meant until eol character appears, which is indeed how the line-oriented functions gets() and fgets() work.
and while opening a file in text mode carriage return(CR) is converted into CRLF which is eol marker so if i add white spaces in my text then would it act as eol maker cause it does.
Opening the file does not convert anything, but reading from a file might. However, no environment (that I know of) converts input to CR LF. MSDOS converts CR LF on input to \n.
Adding spaces has no effect on end of lines, end of file, or anything. Spaces are just data. However, the C++ streaming operations reading/writing numbers and some other datatypes use whitespace (a sequence of spaces, horizontal tabs, vertical tabs, form feed, and maybe some others) as a delimiter. This convenience feature may cause some confusion.
Now i created a normal file i.e. a file without .txt eg
ifstream("test"); \No .txt
Now what is eol marker in this case
The filename does not determine the file type. In fact, file.txt may not be a text file at all. Using a particular file extension is convenient for humans to communicate a file's purpose, but it is not obligatory.
I understand that QString::split should be used to get a QStringList from a multiline QString. But if I have a file and I don't know if it comes from Mac, Windows or Unix, I'm not sure if QString.split("\n") would work well in all the cases. What is the best way to handle this situation?
If it's acceptable to remove blank lines, you can try:
QString.split(QRegExp("[\r\n]"),QString::SkipEmptyParts);
This splits the string whenever any of the newline character (either line feed or carriage return) is found. Any consecutive line breaks (e.g. \r\n\r\n or \n\n) will be considered multiple delimiters with empty parts between them, which will be skipped.
Emanuele Bezzi's answer misses a couple of points.
In most cases, a string read from a text file will have been read using a text stream, which automatically translates the OS's end-of-line representation to a single '\n' character. So if you're dealing with native text files, '\n' should be the only delimiter you need to worry about. For example, if your program is running on a Windows system, reading input in text mode, line endings will be marked in memory with single \n characters; you'll never see the "\r\n" pairs that exist in the file.
But sometimes you do need to deal with "foreign" text files.
Ideally, you should probably translate any such files to the local format before reading them, which avoids the issue. Only the translation utility needs to be aware of variant line endings; everything else just deals with text.
But that's not always possible; sometimes you might want your program to handle Windows text files when running on a POSIX system (Linux, UNIX, etc.), or vice versa.
A Windows-format text file on a POSIX system will appear to have an extra '\r' character at the end of each line.
A POSIX-format text file on a Windows system will appear to consist of one very long line with embedded '\n' characters.
The most general approach is to read the file in binary mode and deal with the line endings explicitly.
I'm not familiar with QString.split, but I suspect that this:
QString.split(QRegExp("[\r\n]"),QString::SkipEmptyParts);
will ignore empty lines, which will appear either as "\n\n" or as "\r\n\r\n", depending on the format. Empty lines are perfectly valid text data; you shouldn't ignore them unless you're certain that it makes sense to do so.
If you need to deal with text input delimited either by "\n", "\r\n", or "\r", then I think something like this:
QString.split(QRegExp("\n|\r\n|\r"));
would do the job. (Thanks to parsley72's comment for helping me with the regular expression syntax.)
Another point: you're probably not likely to encounter text files that use just '\r' to delimit lines. That's the format used by MacOS up to version 9. MaxOS X is based on UNIX, and it uses standard UNIX-style '\n' line endings (though it probably tolerates '\r' line endings as well).