Reading a file into a string buffer and detecting EOF - c++

I am opening a file and placing it's contents into a string buffer to do some lexical analysis on a per-character basis. Doing it this way enables parsing to finish faster than using a subsequent number of fread() calls, and since the source file will always be no larger than a couple MBs, I can rest assured that the entire contents of the file will always be read.
However, there seems to be some trouble in detecting when there is no more data to be parsed, because ftell() often gives me an integer value higher than the actual number of characters within the file. This wouldn't be a problem with the use of the EOF (-1) macro, if the trailing characters were always -1... But this is not always the case...
Here's how I am opening the file, and reading it into the string buffer:
FILE *fp = NULL;
errno_t err = _wfopen_s(&fp, m_sourceFile, L"rb, ccs=UNICODE");
if(fp == NULL || err != 0) return FALSE;
if(fseek(fp, 0, SEEK_END) != 0) {
fclose(fp);
fp = NULL;
return FALSE;
}
LONG fileSize = ftell(fp);
if(fileSize == -1L) {
fclose(fp);
fp = NULL;
return FALSE;
}
rewind(fp);
LPSTR s = new char[fileSize];
RtlZeroMemory(s, sizeof(char) * fileSize);
DWORD dwBytesRead = 0;
if(fread(s, sizeof(char), fileSize, fp) != fileSize) {
fclose(fp);
fp = NULL;
return FALSE;
}
This always appears to work perfectly fine. Following this is a simple loop, which checks the contents of the string buffer one character at a time, like so:
char c = 0;
LONG nPos = 0;
while(c != EOF && nPos <= fileSize)
{
c = s[nPos];
// do something with 'c' here...
nPos++;
}
The trailing bytes of the file are usually a series of ý (-3) and « (-85) characters, and therefore EOF is never detected. Instead, the loop simply continues onward until nPos ends up being of higher value than fileSize -- Which is not desirable for proper lexical analysis, because you often end up skipping the final token in a stream which omits a newline character at the end.
In a Basic Latin character set, would it be safe to assume that an EOF char is any character with a negative value? Or perhaps there is just a better way to go about this?
#EDIT: I have just tried to implement the feof() function into my loop, and all the same, it doesn't seem to detect EOF either.

Assembling comments into an answer...
You leak memory (potentially a lot of memory) when you fail to read.
You haven't allowed for a null terminator at the end of the string read.
There's no point in zeroing the memory when it is all about to be overwritten by the data from the file.
Your test loop is accessing memory out of bounds; nPos == fileSize is one beyond the end of the memory you allocated.
char c = 0;
LONG nPos = 0;
while(c != EOF && nPos <= fileSize)
{
c = s[nPos];
// do something with 'c' here...
nPos++;
}
There are other problems, not previously mentioned, with this. You did ask if it is 'safe to assume that an EOF char is any character with a negative value', to which I responded No. There are several issues here, that affect both C and C++ code. The first is that plain char may be a signed type or an unsigned type. If the type is unsigned, then you can never store a negative value in it (or, more accurately, if you attempt to store a negative integer into an unsigned char, it will be truncated to the least significant 8* bits and will be treated as positive.
In the loop above, one of two problems can occur. If char is a signed type, then there is a character (ÿ, y-umlaut, U+00FF, LATIN SMALL LETTER Y WITH DIAERESIS, 0xFF in the Latin-1 code set) that has the same value as EOF (which is always negative and usually -1). Thus, you might detect EOF prematurely. If char is an unsigned type, then there will never be any character equal to EOF. But the test for EOF on a character string is fundamentally flawed; EOF is a status indicator from I/O operations and not a character.
During I/O operations, you will only detect EOF when you've attempted to read data that isn't there. The fread() won't report EOF; you asked to read what was in the file. If you tried getc(fp) after the fread(), you'd get EOF unless the file had grown since you measured how long it is. Since _wfopen_s() is a non-standard function, it might be affecting how ftell() behaves and the value it reports. (But you later established that wasn't the case.)
Note that functions such as fgetc() or getchar() are defined to return characters as positive integers and EOF as a distinct negative value.
If the end-of-file indicator for the input stream pointed to by stream is not set and a
next character is present, the fgetc function obtains that character as an unsigned
char converted to an int.
If the end-of-file indicator for the stream is set, or if the stream is at end-of-file, the end-of-
file indicator for the stream is set and the fgetc function returns EOF. Otherwise, the
fgetc function returns the next character from the input stream pointed to by stream.
If a read error occurs, the error indicator for the stream is set and the fgetc function
returns EOF.289)
289) An end-of-file and a read error can be distinguished by use of the feof and ferror functions.
This indicates how EOF is separate from any valid character in the context of I/O operations.
You comment:
As for any potential memory leakage... At this stage in my project, memory leaks are one of many problems with my code which, as of yet, are of no concern to me. Even if it didn't leak memory, it doesn't even work to begin with, so what's the point? Functionality comes first.
It is easier to head off memory leaks in error paths at the initial coding stage than to go back later and fix them — because you may not spot them because you may not trigger the error condition. However, the extent to which that matters depends on the intended audience for the program. If it is a one-off for a coding course, you may be fine. If you're the only person who'll use it, you may be fine. But if it will be installed by millions, you'll have problems retrofitting the checks everywhere.
I have swapped _wfopen_s() with fopen() and the result from ftell() is the same. However, after changing the corresponding lines to LPSTR s = new char[fileSize + 1], RtlZeroMemory(s, sizeof(char) * fileSize + 1); (which should also null-terminate it, btw), and adding if(nPos == fileSize) to the top of the loop, it now comes out cleanly.
OK. You could use just s[fileSize] = '\0'; to null terminate the data too, but using RtlZeroMemory() achieves the same effect (but would be slower if the file is many megabytes in size). But I'm glad the various comments and suggestions helped get you back on track.
* In theory, CHAR_BITS might be larger than 8; in practice it is almost always 8 and for simplicity, I'm assuming it is 8 bits here. The discussion has to be more nuanced if CHAR_BITS is 9 or more, but the net effect is much the same.

Related

fstream::read() read empty if input size too big

I have tried to read a file by using istream& read (char* s, streamsize n). I have read the description at: http://www.cplusplus.com/reference/istream/istream/read/ saying
If the input sequence runs out of characters to extract (i.e., the end-of-file is reached) before n characters have been successfully read, the array pointed to by s contains all the characters read until that point, and both the eofbit and failbit flags are set for the stream.
Because of that I have put the n with a very large number because I trust the caller that able to allocate enough buffer to read. But I always receive 0 read, I have tried following code to read txt file with 90 bytes:
std::wstring name(L"C:\\Users\\dle\\Documents\\01_Project\\01_VirtualMachine\\99_SharedFolder\\lala.txt");
std::ifstream ifs;
ifs.open(name, ifstream::binary | ifstream::in);
if (ifs)
{
// get length of file:
ifs.seekg(0, ifs.end);
int length = ifs.tellg();
ifs.seekg(0, ifs.beg);
char *buffer = new char[length];
ifs.read(buffer, UINT32_MAX);
int success = ifs.gcount();
cout << "success: " << success << endl;
cout << "size: " << size;
ifs.close();
}
I even tried with smaller number, eg: 500,000 and it still failed. I have realized that the "n" and the size of file related somehow, the "n" could not be larger than file size too much or else it will read empty....
I know we could fix that easily by putting correct size to read() but I wonder why it happened like that? I should read till EOF then stop right? Could anyone explain to me why please?
EDIT: I just simply want to read to EOF by utilizing istream& read without caring about file size. According to the definition of istream& read(char*s, streamsize n)it should work.
ifs.read(buffer, UINT32_MAX);
The second parameter to fstream::read is std::streamsize, which is defined as (emphasis mine)...
...a signed integral type...
I therefore guess (as I don't have a Windows environment to test on at this point) that you're working on a machine where std::streamsize is 32bit, and you're looking at your UINT32_MAX ending up as a -1 (and #john testing on a machine where sizeof( std::streamsize ) > 4 so that his UINT32_MAX doesn't wrap into the negative.)
Try again with std::numeric_limits< std::streamsize >::max()... or even better yet, use length because, well, you have the file size right there and don't have to rely on the EOF behavior of fstream::read to save you.
I am not sure whether C++ changed the definition of streams from what the C standard says, but note that C's definition on binary streams states that they...
...may, however, have an implementation-defined number of null characters appended to the end of the stream.
So your, or the user's, assumption that a buffer big enough to hold the data written earlier is big enough to hold the data read till EOF might actually fail.

eof bit not set even if offset is beyond file size

I have a fstream pointer fileP_.
I open a file with:
fileP_.open(filePath_.c_str(), std::ios::in|std::ios::binary);
I have a Read() function with it's definition as:
int Read(size_t offset, char *buffer, size_t *size)
So here I read #size in #buffer starting from offset #offset of a file.
My code of Read() is somewhat like this:
int rc = 0
fileP_.seekg(offset);
fileP_.read(buffer, *size);
if (!fileP_.gcount()) {
if (fileP_.eof())
*size = rc;
else if (fileP_.fail())
rc = -EIO;
....
The code works fine until offset < filesize, but if I give offset > filesize gcount() gives 0(which is expected) but I get -EIO, and I expect that if offset > filesize size = rc = 0.
Am I missing anything in the above code?
Thanks!
If you seekg > filesize the operation fails, the failbit is set and read does not work... (eof has not been set)
If one operation fails, failbit is activated and all the following operation will be no-op until state bits are cleared. In this case, if seekg fails, istream::read will not read anything and, in particular, will not set eofbit.
On the other hand, eofbit is not activated when the position in the file is "at the end". Actually, it is activated when the stream detects the end of file, that is, when it try to gets the next char and an EOF is returned.
In general, in C++ it is not a good idea to control the end of input with eofbit. It is better to test if the operation has been successfull. When the operation fails, then test if the problem is the end of file by using eofbit.

String is not null terminated error

I'm having a string is not null terminated error, though I'm not entirely sure why. The usage of std::string in the second part of the code is one of my attempt to fix this problem, although it still doesn't work.
My initial codes was just using the buffer and copy everything into client_id[]. The error than occurred. If the error is correct, that means I've got either client_ id OR theBuffer does not have a null terminator. I'm pretty sure client_id is fine, since I can see it in debug mode. Strange thing is buffer also has a null terminator. No idea what is wrong.
char * next_token1 = NULL;
char * theWholeMessage = &(inStream[3]);
theTarget = strtok_s(theWholeMessage, " ",&next_token1);
sendTalkPackets(next_token1, sizeof(next_token1) + 1, id_clientUse, (unsigned int)std::stoi(theTarget));
Inside sendTalkPackets is. I'm getting a string is not null terminated at the last line.
void ServerGame::sendTalkPackets(char * buffer, unsigned int buffersize, unsigned int theSender, unsigned int theReceiver)
{
std::string theMessage(buffer);
theMessage += "0";
const unsigned int packet_size = sizeof(Packet);
char packet_data[packet_size];
Packet packet;
packet.packet_type = TALK;
char client_id[MAX_MESSAGE_SIZE];
char theBuffer[MAX_MESSAGE_SIZE];
strcpy_s(theBuffer, theMessage.c_str());
//Quick hot fix for error "string not null terminated"
const char * test = theMessage.c_str();
sprintf_s(client_id, "User %s whispered: ", Usernames.find(theSender)->second.c_str());
printf("This is it %s ", buffer);
strcat_s(client_id, buffersize , theBuffer);
Methinks that problem lies in this line:
sendTalkPackets(next_token1, sizeof(next_token1) + 1, id_clientUse, (unsigned int)std::stoi(theTarget));
sizeof(next_token1)+1 will always gives 5 (on 32 bit platform) because it return size of pointer not size of char array.
One thing which could be causing this (or other problems): As
buffersize, you pass sizeof(next_token1) + 1. next_token1 is
a pointer, which will have a constant size of (typically) 4 or 8. You
almost certainly want strlen(next_token1) + 1. (Or maybe without the
+ 1; conventions for passing sizes like this generally only include
the '\0' if it is an output buffer. There are a couple of other
places where you're using sizeof, which may have similar problems.
But it would probably be better to redo the whole logic to use
std::string everywhere, rather than all of these C routines. No
worries about buffer sizes and '\0' terminators. (For protocol
buffers, I've also found std::vector<char> or std::vector<unsigned char>
quite useful. This was before the memory in std::string was
guaranteed to be contiguous, but even today, it seems to correspond more
closely to the abstraction I'm dealing with.)
You can't just do
std::string theMessage(buffer);
theMessage += "0";
This fails on two fronts:
The std::string constructor doesn't know where buffer ends, if buffer is not 0-terminated. So theMessage will potentially be garbage and include random stuff until some zero byte was found in the memory beyond the buffer.
Appending string "0" to theMessage doesn't help. What you want is to put a zero byte somewhere, not value 0x30 (which is the ascii code for displaying a zero).
The right way to approach this, is to poke a literal zero byte buffersize slots beyond the start of the buffer. You can't do that in buffer itself, because buffer may not be large enough to accomodate that extra zero byte. A possibility is:
char *newbuffer = malloc(buffersize + 1);
strncpy(newbuffer, buffer, buffersize);
newbuffer[buffersize] = 0; // literal zero value
Or you can construct a std::string, whichever you prefer.

Size error on read file

RESOLVED
I'm trying to make a simple file loader.
I aim to get the text from a shader file (plain text file) into a char* that I will compile later.
I've tried this function:
char* load_shader(char* pURL)
{
FILE *shaderFile;
char* pShader;
// File opening
fopen_s( &shaderFile, pURL, "r" );
if ( shaderFile == NULL )
return "FILE_ER";
// File size
fseek (shaderFile , 0 , SEEK_END);
int lSize = ftell (shaderFile);
rewind (shaderFile);
// Allocating size to store the content
pShader = (char*) malloc (sizeof(char) * lSize);
if (pShader == NULL)
{
fputs ("Memory error", stderr);
return "MEM_ER";
}
// copy the file into the buffer:
int result = fread (pShader, sizeof(char), lSize, shaderFile);
if (result != lSize)
{
// size of file 106/113
cout << "size of file " << result << "/" << lSize << endl;
fputs ("Reading error", stderr);
return "READ_ER";
}
// Terminate
fclose (shaderFile);
return 0;
}
But as you can see in the code I have a strange size difference at the end of the process which makes my function crash.
I must say I'm quite a beginner in C so I might have missed some subtilities regarding the memory allocation, types, pointers...
How can I solve this size issue?
*EDIT 1:
First, I shouldn't return 0 at the end but pShader; that seemed to be what crashed the program.
Then, I change the type of reult to size_t, and added a end character to pShader, adding pShdaer[result] = '/0'; after its declaration so I can display it correctly.
Finally, as #JamesKanze suggested, I turned fopen_s into fopen as the previous was not usefull in my case.
First, for this sort of raw access, you're probably better off
using the system level functions: CreateFile or open,
ReadFile or read and CloseHandle or close, with
GetFileSize or stat to get the size. Using FILE* or
std::filebuf will only introduce an additional level of
buffering and processing, for no gain in your case.
As to what you are seeing: there is no guarantee that an ftell
will return anything exploitable as a numeric value; it could
very well be just a magic cookie. On most current systems, it
is a byte offset into the physical file, but on any non-Unix
system, the offset into the physical file will not map directly
to the logical file you are reading unless you open the file in
binary mode. If you use "rb" to open the file, you'll
probably see the same values. (Theoretically, you could get
extra 0's at the end of the file, but practically, the OS's
where that happened are either extinct, or only used on legacy
mainframes.)
EDIT:
Since the answer stating this has been deleted: you should loop
on the fread until it returns 0 (setting errno to 0 before
each call, and checking it after the return to see whether the
function returned because of an error or because it reached the
end of file). Having said this: if you're on one of the usual
Windows or Unix systems, and the file is local to the machine,
and not too big, fread will read it all in one go. The
difference in size you are seeing (given the numerical values
you posted) is almost certainly due to the fact that the two
byte Windows line endings are being mapped to a single '\n'
character. To avoid this, you must open in binary mode;
alternatively, if you really are dealing with text (and want
this mapping), you can just ignore the extra bytes in your
buffer, setting the '\0' terminator after the last byte
actually read.

Using C++ to find out how many lines are in a text file

My C++ program needs to know how many lines are in a certain text file. I could do it with getline() and a while-loop, but is there a better way?
No.
Not unless your operating system's filesystem keeps track of the number of lines, which your system almost certainly doesn't as it's been a looong time since I've seen that.
By "another way", do you mean a faster way? No matter what, you'll need to read in the entire contents of the file. Reading in different-sized chunks shouldn't matter much since the OS or the underlying file libraries (or both) are buffering the file contents.
getline could be problematic if there are only a few lines in a very large file (high transient memory usage), so you might want to read in fixed-size 4KB chunks and process them one-by-one.
Iterate the file char-by-char with get(), and for each newline (\n) increment line number by one.
The fastest, but OS-dependent way would be to map the whole file to memory (if not possible to map the whole file at once - map it in chunks sequentially) and call std::count(mem_map_begin,mem_map_end,'\n')
Don't know if getline() is the best - buffer size is variable at the worst case (sequence of \n) it could read byte after byte in each iteration.
For me It would be better to read a file in a chunks of predetermined size. And than scan for number of new line encodings ( inside.
Although there's some risk I cannot / don't know how to resolve: other file encodings than ASCII. If getline() will handle than it's easiest but I don't think it's true.
Some url's:
Why does wide file-stream in C++ narrow written data by default?
http://en.wikipedia.org/wiki/Newline
possibly fastest way is to use low level read() and scan buffer for '\n':
int clines(const char* fname)
{
int nfd, nLen;
int count = 0;
char buf[BUFSIZ+1];
if((nfd = open(fname, O_RDONLY)) < 0) {
return -1;
}
while( (nLen = read(nfd, buf, BUFSIZ)) > 0 )
{
char *p = buf;
int n = nLen;
while( n && (p = memchr(p,'\n', n)) ) {
p++;
n = nLen - (p - buf);
count++;
}
}
close(nfd);
return count;
}