c++: Istream counts every newline in a .txt file as two - c++

I've got a slight problem. It appears that for some reason my function, when counting the size of a .txt file, counts a newline as it was two chars instead of one. Here's the function:
#define IN_FILE "in_mat.txt"
#define IN_BUF
#ifdef IN_BUF
void inBuf(char *(&b)){
streampos size;
ifstream f(IN_FILE, ios::in);
f.seekg(0,ios::end);
size=f.tellg();
b=new char[size];
f.seekg(0, ios::beg);
f.read(b, size);
f.close();
}
#endif
And here's the read file:
2 2
1 0
0 1
2 2
i 0
0 -i
2 2
0 1
-1 0
2 2
0 i
i 0
Earlier, i've put some couts, and it appears, that size=60, while the actual size is 49 (checked it), and the count of newlines in the file is 11, so exactly 60-49. Could somebody help me with that?

To add to the other answers, if you want to read special characters such as newline characters, you should open your file in binary mode, not text mode.
ifstream f(IN_FILE, ios::in | ios::binary);
If you don't open the file in binary mode, the actual characters that make up the '\n' are translated by the runtime to a single character (namely '\n'). So in text mode, you don't get the "real" version of the file in terms of all of the actual characters that the file consists of.
In addition, functions such as seekg() and tellg() will not work as expected with a file opened in text mode, or at the very least, will give you "wrong results" (actually not wrong to the functions themselves, but wrong if you're writing a program that tries to "hone in" on a position within the file). Again, the newline (and EOF) translation that is done under the hood by the runtime gets in the way of these functions working as you would expect them to.
On the other hand, a file opened in binary mode allows these functions to work as expected -- no translation of newline, or EOF -- whatever the individual bytes that makes up the file contents are, that is what you get.
The next thing you need to determine is whether it is a Unix text file or a Windows text file. Depending on which one it is, the line endings will be different.

Windows uses "\r\n" to return to the beginning of the line ('\r') and begin a new one ('\n').
To remove them from your count you have to read the whole file and count the number of '\r's.

Windows stores newlines as two characters: '\r\n', known as carriage return and line feed. That's why it's counted twice: there are actually two characters to be counted.

I am assuming that you are running on Windows. If not, disregard my answer below.
Windows stores new line characters in text files as two characters (CR LF or '\r' '\n'). So, seeking to the end of the file and calling tellg() will return the binary size of the file (60), not the text size (49).
In order to get the correct text size (49), one solution would be to count each new line character (11) and subtract that number from the total byte size.

Related

Number getting stored as special character in C++

#include<fstream>
#include<string.h>
#include<iostream>
using namespace std;
class contact
{
long long ph;
unsigned char name[20],add[50],email[30];
public:
void create_contact()
{
cout<<"Phone: ";
cin>>ph;
cout<<"Name: ";
cin.ignore();
cin>>name;
cout<<"Address: ";
cin.ignore();
cin>>add;
cout<<"Email address: ";
cin.ignore();
cin>>email;
cout<<"\n";
}
void show_contact()
{
cout<<endl<<"Phone Number: "<<ph;
cout<<endl<<"Name: "<<name;
cout<<endl<<"Address: "<<add;
cout<<endl<<"Email Address : "<<email;
}
long long getPhone()
{
return ph;
}
unsigned char* getName()
{
return name;
}
unsigned char* getAddress()
{
return add;
}
unsigned char* getEmail()
{
return email;
}
};
fstream fp;
contact cont;
void save_contact()
{
fp.open("contactBook.txt",ios::out|ios::app);
cont.create_contact();
fp.write((char*)&cont,sizeof(contact));
fp.close();
cout<<endl<<endl<<"Contact Has Been Sucessfully Created...";
getchar();
Hey there, I am new to C++ as well as this community and in this is the code that I have been working on, the phone number of the contact is getting saved as random special characters. This is half of the code where I think the problem occurs Any ideas on how I could fix it? It would be of much help. Thanks!
I take it you expected to see the phone number written out in your text file as something like "15551234567." However, long long is not stored in this form in memory. It's actually stored as a 64-bit binary integer. The special characters you describe are likely the encoded version of that integer. If you read the data back in, you should find that it is still an integer.
However, there is one remaining issue. You are missing ios::binary on the fstream open command. Each of the ios flag imbues the stream with a particular behavior:
ios::out - indicates that this stream should be an output stream that you can write bytes to
ios::app - indicates that this stream should be opened in "append" mode. This means that it will not erase the contents of the file every time you open it, and any bytes outputted to the stream are appended to the end of the file.
ios::binary - opens the file in binary mode, which is needed when you want to input/output binary data, rather than just text.
You want to open the file with ios::out | ios::app | ios::binary. Forgetting binary is going to lead to very difficult to debug errors.
Now binary mode is a bit of a pest. Sorry for this being a long read, but its a lot easier to come to grips with this flag if you understand the history behind it.
Way back in the early days of computing, there was a disagreement about how to write a newline into a file. This way the days of typewriters, where starting a new line was broken into two actions. There was "carriage return" which moved the sliding bit of the typewriter back to the start of the line (this was the loud part of the motion), and there was a "line feed" which moved the paper up one spot. Each of these were separate actions, so they were given separate characters in ASCII, one of the the definitive ways to write text as a string of bytes. The 8-bit number 10 encoded a line feed (aka LF), and the 8-bit number 13 encoded a carriage return (aka CR). This would permit one to do things like overtyping, a trick where one types one character (like a letter) and then goes back to add another over the top (like an accent). You might write à by first typing a, doing a "carriage return" and then writing a `, just like you did on a typewriter.
Some operating systems (such as Windows) encoded the start of the next line as both of these characters, so you'd see CR LF in a text file. Other operating systems (such as Unix) decided that it wasn't worth wasting a precious byte at the end of every line, so they chose to represent the start of a new line just with a LF. Others (such as Macintosh), decided to represent the new line as CR. Nobody could agree.
To deal with this, many file reading/writing APIs treat these characters specially. fopen and fstream follow a pattern where if they see a CR LF or a CR in a text file, they silently turn it into a LF character when read. This lets you read every file type. Likewise, if it sees a LF character when writing, it expands it to whatever the platform specified a new line should look like. This lets you write cross-platform code which writes text files without having to pay attention to which new line character is used on each platform!
However, this causes huge problems for binary data. Consider the number 302,844,416 written as a 32 bit number. In hexadecimal, we would write that as 0x120D0A00 (hex is a popular way to write numbers in programming because every byte can be written as 2 characters in hex). The issue is the middle two bytes of the number, 0x0D and 0x0A. In decimal, these are 13 and 10, which you should recognize as the same bytes as CR and LF.
If the program tries to read that number in "text mode," it will see the CR LF pair, and turn it into just a single LF, per the C rules. Now, instead of our number being 0x120D0A00, its 0x120A00XX, where the XX is whatever the next byte was in the file. Very bad things! Not only is this data corrupted, but you probably needed the next byte for whatever came next in the file!
ios::binary and the "b" flag for fopen resolve this. They tell C/C++ that the data is going to be binary. There wont be any new lines to convert. If you write bytes to a binary stream, they get written directly to the file, without any clever attempts to handle new lines.
Your phone number is stored as a long long, which is a binary integer format. Without ios::binary, you run the risk of the number just happening to have a CR LF pair in it, and fstream will corrupt your data. ios::binary tells fstream to not mess with the data in that way.

C++ std::ifstream std::ios::in versus std::ios::in | std::ios::binary, why do I get junk data?

I'm curious when I call read on an std::ifstream object, why I get junk data if I open the file as std::ios::in, whereas I don't get junk data with std::ios::in | std::ios::binary?
I included screenshots of some messy code I've been trying stuff out with. I'm just confused why I get junk data with the first picture, when the second picture produces the correct data with the std::ios::binary flag set.
Junk data, but correct file length:
No junk data, same file length:
In text mode, certain characters may be transformed.
On cppreference, it says this about binary vs. text mode for files:
Data read in from a text stream is guaranteed to compare equal to the
data that were earlier written out to that stream only if all of the
following is true:
the data consist only of printing characters and the control
characters \t and \n (in particular, on Windows OS, the
character '\0x1A' terminates input)
no \n is immediately preceded by a space character (space
characters that are written out immediately
before a \n may disappear when read)
the last character is \n
At a guess I would say that some of the characters in your input file do not obey these rules.
In text mode, file positions aren't the number of bytes you can read.
So when you seek to the end, and see the file position is 24, that doesn't mean you can read 24 bytes. In fact you only read 20 bytes, but your loop ran 24 times, printing the 20 read bytes and another 4 garbage bytes from whatever was already in memory.
To find out the actual number of bytes that were read, you can call file_data.gcount() after calling file_data.read.

Checking line breaks (newline characters) with CFile

I need to check whether a file is ended with a line break or not, using CFile.
What I have tried:
point the file pointer at the end of the file
move the pointer back by 2 units
check if the pointer is pointing at \r\n
Here is my code:
cfile.SeekToEnd();
cfile.Seek(-2, CFile::current);
char buffer[2];
cfile.Read(buffer, 2);
if(buffer[0] == '\r' && buffer[1] == '\n') printf("Ended with line break!");
else printf("Not ended with line break!");
However, what I found is that the buffer gives me a \n character and a garbage character (with weird values like 204). After some research through the documentation, I found that CFile::Read only count \r\n as a single character:
For text-mode files, carriage return–linefeed pairs are counted as single characters.
I am so confused because the file pointer obviously still counts 2 characters but I cannot get both of them. Is there any method to check line break at the end of a file with CFile?
I solved the problem by using CFile::typeBinary when opening a file.
It seems that CFile::typeText do something special processing carriage return–linefeed pairs (e.g. filling in a garbage character which is supposed to be a carriage return) which is not desired in my case.

Accessing to information in a ".txt" file and go to a determinated row

When accessing a text file, I want to read from a specific line. Let's suppose that my file has 1000 rows and I want to read row 330. Each row has a different number of characters and could possibly be quite long (let's say around 100,000,000 characters per row). I'm thinking fseek() can't be used effectively here.
I was thinking about a loop to track linebreaks, but I don't know how exactly how to implement it, and I don't know if that would be the best solution.
Can you offer any help?
Unless you have some kind of index saying "line M begins at position N" in the file, you have to read characters from the file and count newlines until you find the desired line.
You can easily read lines using std::getline if you want to save the contents of each line, or std::istream::ignore if you want to discard the contents of the lines you read until you find the desired line.
There is no way to know where row 330 starts in an arbitrary text file without scanning the whole file, finding the line breaks, and then counting.
If you only need to do this once, then scan. If you need to do it many times, then you can scan once, and build up a data structure listing where all of the lines start. Now you can figure out where to seek to to read just that line. If you're still just thinking about how to organize data, I would suggest using some other type of data structure for random access. I can't recommend which one without knowing the actual problem that you are trying to solve.
Create an index on the file. You can do this "lazily" but as you read a buffer full you may as well scan it for each character.
If it is a text file on Windows that uses a 2-byte '\n' then the number of characters you read to the point where the newline occurs will not be the offset. So what you should do is a "seek" after each call to getline().
something like:
std::vector< off_t > lineNumbers;
std::string line;
lineNumbers.push_back(0); // first line begins at 0
while( std::getline( ifs, line ) )
{
lineNumbers.push_back(ifs.tellg());
}
last value will tell you where EOF is.
I think you need to scan the file and count the \n occurrences since you find the desired line. If this is a frequent operation, and you are the only one you write the file, you can possibly mantain an index file containing such information side by side with the one containing the data, a sort of "poor-man-index", but can save a lot of time.
Try running fgets in a loop
/* fgets example */
#include <stdio.h>
int main()
{
FILE * pFile;
char mystring [100];
pFile = fopen ("myfile.txt" , "r");
if (pFile == NULL) perror ("Error opening file");
else {
fgets (mystring , 100 , pFile);
puts (mystring);
fclose (pFile);
}
return 0;
}

Difference between files written in binary and text mode

What translation occurs when writing to a file that was opened in text mode that does not occur in binary mode? Specifically in MS Visual C.
unsigned char buffer[256];
for (int i = 0; i < 256; i++) buffer[i]=i;
int size = 1;
int count = 256;
Binary mode:
FILE *fp_binary = fopen(filename, "wb");
fwrite(buffer, size, count, fp_binary);
Versus text mode:
FILE *fp_text = fopen(filename, "wt");
fwrite(buffer, size, count, fp_text);
I believe that most platforms will ignore the "t" option or the "text-mode" option when dealing with streams. On windows, however, this is not the case. If you take a look at the description of the fopen() function at: MSDN, you will see that specifying the "t" option will have the following effect:
line feeds ('\n') will be translated to '\r\n" sequences on output
carriage return/line feed sequences will be translated to line feeds on input.
If the file is opened in append mode, the end of the file will be examined for a ctrl-z character (character 26) and that character removed, if possible. It will also interpret the presence of that character as being the end of file. This is an unfortunate holdover from the days of CPM (something about the sins of the parents being visited upon their children up to the 3rd or 4th generation). Contrary to previously stated opinion, the ctrl-z character will not be appended.
In text mode, a newline "\n" may be converted to a carriage return + newline "\r\n"
Usually you'll want to open in binary mode. Trying to read any binary data in text mode won't work, it will be corrupted. You can read text ok in binary mode though - it just won't do automatic translations of "\n" to "\r\n".
See fopen
Additionally, when you fopen a file with "rt" the input is terminated on a Crtl-Z character.
Another difference is when using fseek
If the stream is open in binary mode, the new position is exactly offset bytes measured from the beginning of the file if origin is SEEK_SET, from the current file position if origin is SEEK_CUR, and from the end of the file if origin is SEEK_END. Some binary streams may not support the SEEK_END.
If the stream is open in text mode, the only supported values for offset are zero (which works with any origin) and a value returned by an earlier call to std::ftell on a stream associated with the same file (which only works with origin of SEEK_SET.
Even though this question was already answered and clearly explained, I think it would be interesting to show the main issue (translation between \n and \r\n) with a simple code example. Note that I'm not addressing the issue of the Crtl-Z character at the end of the file.
#include <stdio.h>
#include <string.h>
int main() {
FILE *f;
char string[] = "A\nB";
int len;
len = strlen(string);
printf("As you'd expect string has %d characters... ", len); /* prints 3*/
f = fopen("test.txt", "w"); /* Text mode */
fwrite(string, 1, len, f); /* On windows "A\r\nB" is writen */
printf ("but %ld bytes were writen to file", ftell(f)); /* prints 4 on Windows, 3 on Linux*/
fclose(f);
return 0;
}
If you execute the program on Windows, you will see the following message printed:
As you'd expect string has 3 characters... but 4 bytes were writen to file
Of course you can also open the file with a text editor like Notepad++ and see yourself the characters:
The inverse conversion is performed on Windows when reading the file in text mode.
We had an interesting problem with opening files in text mode where the files had a mixture of line ending characters:
1\n\r
2\n\r
3\n
4\n\r
5\n\r
Our requirement is that we can store our current position in the file (we used fgetpos), close the file and then later to reopen the file and seek to that position (we used fsetpos).
However, where a file has mixtures of line endings then this process failed to seek to the actual same position. In our case (our tool parses C++), we were re-reading parts of the file we'd already seen.
Go with binary - then you can control exactly what is read and written from the file.
In 'w' mode, the file is opened in write mode and the basic coding is 'utf-8'
in 'wb' mode, the file is opened in write -binary mode and it is resposible for writing other special characters and the encoding may be 'utf-16le' or others