Can open small ASCII file, but not large binary file? - c++

I am using the below code to open a large (5.1GB) binary file in MSVC on Windows. The machine has plenty of RAM. The problem is the length is being retrieved as zero. However, when I change the file_path to a smaller ASCII file the code works fine.
Why can I not load the large binary file? I prefer this approach as I wanted a pointer to the file contents.
FILE * pFile;
uint64_t lSize;
char * buffer;
size_t result;
pFile = fopen(file_path, "rb");
if (pFile == NULL) {
fputs("File error", stderr); exit(1);
}
// obtain file size:
fseek(pFile, 0, SEEK_END);
lSize = ftell(pFile); // RETURNS ZERO
rewind(pFile);
// allocate memory to contain the whole file:
buffer = (char*)malloc(sizeof(char)*lSize);
if (buffer == NULL) {
fputs("Memory error", stderr); exit(2);
}
// copy the file into the buffer:
result = fread(buffer, 1, lSize, pFile); // RETURNS ZERO TOO
if (result != lSize) { // THIS FAILS
fputs("Reading error", stderr); exit(3);
}
/* the whole file is now loaded in the memory buffer. */
its not the file permissions or anything, they are fine.

If you allocate 5,1 GB, you'd better be sure that you've compiled your code in 64 bits and run it on a 64 bits windows version. Ohterwhise, the memory address space is limited to maxi 3 GB on a 32 bits Windows and 4 GB with 32 bits code on a 64 bits Windows.
By the way, ftell() returns a signed long. You have to check that there is no error here (such as an overflow if the OS allows larger file sizes), so that the value is not -1.
Edit:
Note that with MSVC, long will currently be a 32 bits number even if compiled for 64 bits. This means that ftell() will give you a meaningful result if the filesize if below 2GB (because fo the sign).
You could use non portable OS specific WinAPI function GetFileSizeEx() to get the size of large files in a signed 64 bit number.
malloc() takes a size_t which is an unsigned 64 bit number. So on this side you're safe.
An alternative would be to use file mapping.
Second edit
I looked at your edits about value received for size, which differ of what i expected. I could reproduce the error on my system, and got a size that was not null, but it was a number much much large than the file.
Looking at this CERT security recommendation, it appeared that the guarantees offered by the standard for fseek() in combination with SEEK_END are unsufficient and make this a very unsafe approach.
So let's repeast: the saffest way to get the size would be to use the native OS function i.e. GetFileSizeEx() on Windows. There's a workaround on a 64 bit windows: use _fseeki64() and _ftelli64():
...
if (_fseeki64(pFile, 0, SEEK_END)) {
fputs("File seek error", stderr);
return (1);
}
lSize = _ftelli64(pFile); // RETURNS EXACT SIZE
...
This worked very well (the initial problem seemed to be linked with the return type which was not large enough). However keep in mind that it's a workaround, and I fear that there could be other error conditions that could lead to the vulnerability reported by CERT.

The data type long is too small to represent you file size. Use the stat() method (or the Windows-specific alternative GetFileAttributes) to read the file size.

Related

AES decrypting files larger than available RAM

Up to this point, I used to decrypt files (located on an USB stick) with AES as follows:
FILE * fp = fopen(filePath, "r");
vector<char> encryptedChars;
if (fp == NULL) {
//Could not open file
continue;
}
while(true) {
int nextEncryptedChar = fgetc(fp);
if (nextEncryptedChar == EOF) {
break;
}
encryptedChars.push_back(nextEncryptedChar);
}
fclose(fp);
char encryptedFileArray[encryptedChars.size()];
int encryptedByteCount = encryptedChars.size();
for (int x = 0; x < aantalChars; x++) {
encryptedFileArray[x] = encryptedChars[x];
}
encryptedChars.clear();
AES aes;
//Decrypt the message in-place
aes.setup(key, AES::KEY_128, AES::MODE_CBC, iv);
aes.decrypt(encryptedFileArray, sizeof(encryptedFileArray));
aes.clear();
This works perfectly for small files. At this point, I am opening a file from a USB stick and storing all characters into a vector and copying the vector to an array. I know that &encryptedChars[0] can be used as an array pointer as well and will save some memory.
Now I want to decrypt a file of 256Kb (as opposed to 1Kb). Copying the data into a source array will require at least 256Kb of RAM. I however only have 100Kb at my disposal and therefore, cannot create a source array containing the encrypted data.
So I tried to use the FILE * that fopen gives me as a FILE pointer, and created a new file on the same USB stick as a destination pointer. I was hoping that the decryption rounds would use the memory of the USB stick as opposed to available memory on the heap.
FILE * fp = fopen(encryptedFilePath, "r");
FILE * fpDecrypt = fopen(decryptedFilePath, "w+");
if (fp == NULL || fpDecrypt == NULL) {
//Could not open file!?
return;
}
AES aes;
//Decrypt the message in-place
aes.setup(key, AES::KEY_128, AES::MODE_CBC, iv);
aes.decrypt((const char*)fp, fpDecrypt, firmwareSize);
aes.clear();
Unfortunately, the system locks up (no idea why).
Does anybody know if I can pass a FILE * to a function that expects a const char * as source and a void * as a destination?
I am using the following library: https://os.mbed.com/users/neilt6/code/AES/docs/tip/AES_8h_source.html
Thanks!
A lot of crypto libraries provide "incremental" APIs that allow a stream of data to be en/decrypted piece by piece, without having to load the stream into memory. Unfortunately, it appears that the library you're using doesn't (or, at least, does not explicitly document it).
However, if you know how CBC mode encryption works, it's possible to roll your own. Basically, all you need to do is take the last AES block (i.e. the last 16 bytes) of the previous chunk of ciphertext and use it as the IV when decrypting (or encrypting) the next block, something like this:
char buffer[1024]; // this needs to be a multiple of 16 bytes!
char ivTemp[16];
while(true) {
int bytesRead = fread(buffer, 1, sizeof(buffer), inputFile);
// save last 16 bytes of ciphertext as IV for next block
if (bytesRead == sizeof(buffer)) memcpy(ivTemp, buffer + bytesRead - 16, 16);
// decrypt the message in-place
AES aes;
aes.setup(key, AES::KEY_128, AES::MODE_CBC, iv);
aes.decrypt(buffer, bytesRead);
aes.clear();
// write out decrypted data (todo: check for write errors!)
fwrite(buffer, 1, bytesRead, outputFile);
// use the saved last 16 bytes of ciphertext as IV for next block
if (bytesRead == sizeof(buffer)) memcpy(iv, ivTemp, 16);
if (bytesRead < sizeof(buffer)) break; // end of file (or read error)
}
Note that this code will overwrite the iv array. That should be OK, though, since you should never use the same IV twice anyway. (In fact, with CBC mode, the IV should be chosen by the encryptor at random, using a cryptographically secure RNG, and sent alongside the message. The usual way to do that is to simply prepend the IV to the message file.)
Also, the code above is somewhat less efficient than it needs to be, since it calls aes.setup() and thus re-runs the whole AES key expansion for each chunk. Unfortunately, I couldn't find any documented way to tell your crypto library to change the IV without re-running the setup.
However, looking at the implementation of your library, as linked by Sister Fister in the comments below, it looks like it's already replacing its internal copy of the IV with the last ciphertext block. Thus, it looks like all you really need to do is call aes.decrypt() for each block without a setup call in between, something like this:
char buffer[1024]; // this needs to be a multiple of 16 bytes!
AES aes;
aes.setup(key, AES::KEY_128, AES::MODE_CBC, iv);
while(true) {
int bytesRead = fread(buffer, 1, sizeof(buffer), inputFile);
// decrypt the chunk of data in-place (continuing from previous chunk)
aes.decrypt(buffer, bytesRead);
// write out decrypted data (todo: check for write errors!)
fwrite(buffer, 1, bytesRead, outputFile);
if (bytesRead < sizeof(buffer)) break; // end of file (or read error)
}
aes.clear();
Note that this code is relying on a feature of the crypto library that does not seem to be explicitly documented, namely that calling aes.decrypt() multiple times will cause the decryptions to be chained correctly. (That's actually a pretty reasonable thing to do, for CBC mode, but you can never be sure without reading the code or finding explicit documentation saying so.) You should make sure to have a comprehensive test suite for this, and to re-run the tests whenever you upgrade the library.
Also note that I haven't tested either of these examples, so there obviously could be bugs or typos. Also, the docs for your crypto library are somewhat sparse, so it's possible that it might not work exactly like I'm assuming it does. Please test anything based on this code throughly before using it!
In general, if something doesn't fit to memory, you can resort to:
Random accessing files. Use fseek to find the position and read or write what you need. Memory requirement minimal.
Processing in batches that will fit in to memory. Memory requirement is adjustable, but the algorithm must be suitable for this.
System virtual memory, which allows you to reserve as big blocks as your system can address, you have free disk space and your system settings. This is usually transparent depending on your system.
Other paged memory mechanisms.
Since AES encryption is made in blocks of 128 bits, and you're short of memory, you should probably use random access on your file.

Incorrect size of file found using Visual Studio C++

I am porting over c++ code from linux to windows. I am currently using Visual Studio 2013 to port my code.
I need to read a binary file and am using this portion of c++ code:
// Open the stream
std::ifstream is("myfile.bin");
// Determine the file length
is.seekg(0, std::ios_base::end);
std::size_t size=is.tellg();
is.seekg(0, std::ios_base::begin);
// Create a vector to store the data
int* Data = new int[size/sizeof(int)];
// Load the data
is.read((char*) &Data[0], size);
// Close the file
is.close();
In linux, the size of my binary file is correctly found to be 744mb. However, in windows, the size of my binary file is incorrectly found to be >4GB. How can I correct this issue?
Change std::ifstream is("myfile.bin"); to std::ifstream is("myfile.bin", std::ios::binary);
With your current default open mode, the compiler choses "char" mode. In Linux chars in files are UTF8, first 128 positions are 1-byte char. But for memory UTF32, 4-bytes per char, is used. In Windows chars are "wide-chars", 2-bytes per char.
I finally had the time to actually run this myself, though I had to fix a couple of things, like ios_base::beg instead of begin (different function) Also, as mentioned, the array allocation should be this int* Data = new int[size / sizeof(int) + 1]; // At most one extra int
I found your problem: you're not in the right directory. Check if you successfully opened the file or not. If you don't, then you get a huge garbage value (probably -1, but unsigned, so massive) for size.
Try this to find your directory in Windows: (probably need Windows.h or something that I "just had" already)
char dirBuf[256];
GetCurrentDirectory(256, dirBuf);
cout << "Current directory is: " << dirBuf << endl;
See if that's where your file is and move it accordingly. Or specify the ENTIRE path in the constructor to ifstream.
Also, it has nothing to do with ios::binary or not. Works fine both ways, or fails if the file isn't there.
std::size_t size=is.tellg();
The standard doesn't require tellg to return the byte offset from the beginning of the file. In general, this may not be a reliable way to get the size of the file, though it probably does what you expect on Linux and Windows.
The return type of the tellg method is std::basic_stream::pos_type, so you're starting with an implicit conversion to std::size_t which may or may not be appropriate. In a 32-bit build, for example, it's conceivable that the size of a file could be larger than a std::size_t can represent.
But the root problem is that you're not checking for errors. If you have exceptions disabled, then tellg reports an error by returning pos_type(-1). When you cast that to an unsigned type (which std::size_t is), then you get a very large value. I suspect you failed to open the file, and since you didn't detect that error, the seekg and the tellg failed. You then coerced pos_type(-1) to a std::size_t, which made it look like the file was huge.
You also have the problems others have noted: failing to open the file in binary mode and computing the wrong size for the buffer when the file isn't a multiple of the size of an int.
The most reliable to get the file size is to use the OS's API. On Windows, you can do this instead:
// Open the file. [TODO: Get the file name in wide characters and use
// CreateFileW instead. If the file name contains characters not
// representable by the user's ANSI codepage, then CreateFileA will fail.]
HANDLE hfile = CreateFileA("myfile.bin", GENERIC_READ, FILE_SHARE_READ,
nullptr, OPEN_EXISTING,
FILE_ATTRIBUTE_NORMAL | FILE_FLAG_SEQUENTIAL_SCAN,
nullptr);
if (hfile == INVALID_HANDLE_VALUE) { error handling here }
// Figure out how big it is.
LARGE_INTEGER li_size;
if (!GetFileSizeEx(hfile, &li_size)) { error handling here }
// TODO: On a 32-bit build, this won't be able to handle huge files,
// so check that here.
std::size_t size = li_size.QuadPart;
// Create a buffer to store the data, being careful to round up to a
// multiple of sizeof(int). [TODO: Use a std::vector instead.]
int* Data = new int[(size + sizeof(int) - 1) / sizeof(int)];
// Load the data.
const DWORD BytesToRead = static_cast<DWORD>(size);
DWORD BytesRead = 0;
if (!ReadFile(hfile, Data, &BytesRead, nullptr) || BytesRead < BytesToRead) {
error handling here
}
// Close the file
CloseHandle(hfile);
int* Data = new int[size/sizeof(int)];
Why are you doing this? You're dividing the size by 4. You don't want to do this. It should just be int* Data = new int[size]
Also, it should be std::ifstream f("filename.bin", std::ios::binary);

Size error on read file

RESOLVED
I'm trying to make a simple file loader.
I aim to get the text from a shader file (plain text file) into a char* that I will compile later.
I've tried this function:
char* load_shader(char* pURL)
{
FILE *shaderFile;
char* pShader;
// File opening
fopen_s( &shaderFile, pURL, "r" );
if ( shaderFile == NULL )
return "FILE_ER";
// File size
fseek (shaderFile , 0 , SEEK_END);
int lSize = ftell (shaderFile);
rewind (shaderFile);
// Allocating size to store the content
pShader = (char*) malloc (sizeof(char) * lSize);
if (pShader == NULL)
{
fputs ("Memory error", stderr);
return "MEM_ER";
}
// copy the file into the buffer:
int result = fread (pShader, sizeof(char), lSize, shaderFile);
if (result != lSize)
{
// size of file 106/113
cout << "size of file " << result << "/" << lSize << endl;
fputs ("Reading error", stderr);
return "READ_ER";
}
// Terminate
fclose (shaderFile);
return 0;
}
But as you can see in the code I have a strange size difference at the end of the process which makes my function crash.
I must say I'm quite a beginner in C so I might have missed some subtilities regarding the memory allocation, types, pointers...
How can I solve this size issue?
*EDIT 1:
First, I shouldn't return 0 at the end but pShader; that seemed to be what crashed the program.
Then, I change the type of reult to size_t, and added a end character to pShader, adding pShdaer[result] = '/0'; after its declaration so I can display it correctly.
Finally, as #JamesKanze suggested, I turned fopen_s into fopen as the previous was not usefull in my case.
First, for this sort of raw access, you're probably better off
using the system level functions: CreateFile or open,
ReadFile or read and CloseHandle or close, with
GetFileSize or stat to get the size. Using FILE* or
std::filebuf will only introduce an additional level of
buffering and processing, for no gain in your case.
As to what you are seeing: there is no guarantee that an ftell
will return anything exploitable as a numeric value; it could
very well be just a magic cookie. On most current systems, it
is a byte offset into the physical file, but on any non-Unix
system, the offset into the physical file will not map directly
to the logical file you are reading unless you open the file in
binary mode. If you use "rb" to open the file, you'll
probably see the same values. (Theoretically, you could get
extra 0's at the end of the file, but practically, the OS's
where that happened are either extinct, or only used on legacy
mainframes.)
EDIT:
Since the answer stating this has been deleted: you should loop
on the fread until it returns 0 (setting errno to 0 before
each call, and checking it after the return to see whether the
function returned because of an error or because it reached the
end of file). Having said this: if you're on one of the usual
Windows or Unix systems, and the file is local to the machine,
and not too big, fread will read it all in one go. The
difference in size you are seeing (given the numerical values
you posted) is almost certainly due to the fact that the two
byte Windows line endings are being mapped to a single '\n'
character. To avoid this, you must open in binary mode;
alternatively, if you really are dealing with text (and want
this mapping), you can just ignore the extra bytes in your
buffer, setting the '\0' terminator after the last byte
actually read.

Reading file with fread() in reverse order causes memory leak?

I have a program that basically does this:
Opens some binary file
Reads the file backwards (by backwards, I mean it starts near EOF, and ends reading at beginning of file, i.e. reads the file "right-to-left"), using 4MB chunks
Closes the file
My question is: why memory consumption looks like below, even though there are no obvious memory leaks in my attached code?
Here's the source of program that was run to obtain above image:
#include <stdio.h>
#include <string.h>
int main(void)
{
//allocate stuff
const int bufferSize = 4*1024*1024;
FILE *fileHandle = fopen("./input.txt", "rb");
if (!fileHandle)
{
fprintf(stderr, "No file for you\n");
return 1;
}
unsigned char *buffer = new unsigned char[bufferSize];
if (!buffer)
{
fprintf(stderr, "No buffer for you\n");
return 1;
}
//get file size. file can be BIG, hence the fseeko() and ftello()
//instead of fseek() and ftell().
fseeko(fileHandle, 0, SEEK_END);
off_t totalSize = ftello(fileHandle);
fseeko(fileHandle, 0, SEEK_SET);
//read the file... in reverse order. This is important.
for (off_t pos = totalSize - bufferSize, j = 0;
pos >= 0;
pos -= bufferSize, j ++)
{
if (j % 10 == 0)
{
fprintf(stderr,
"reading like crazy: %lld / %lld\n",
pos, totalSize);
}
/*
* below is the heart of the problem. see notes below
*/
//seek to desired position
fseeko(fileHandle, pos, SEEK_SET);
//read the chunk
fread(buffer, sizeof(unsigned char), bufferSize, fileHandle);
}
fclose(fileHandle);
delete []buffer;
}
I have also following observations:
Even though RAM usage jumps by 1GB, the whole program uses only 5MB thorough whole execution.
Commenting call to fread() out makes memory leak go away. This is weird, since I don't allocate anything anywhere near it, that could trigger memory leak...
Also, reading the file normally instead of backwards (= commenting call to fseeko() out), makes memory leak go away as well. This is the ultra-weird part.
Further information...
Following doesn't help:
Checking results of fread() - yields nothing out of ordinary.
Switching to normal, 32-bit fseek and ftell.
Doing stuff like setbuf(fileHandle, NULL).
Doing stuff like setvbuf(fileHandle, NULL, _IONBF, *any integer*).
Compiled with g++ 4.5.3 on Windows 7 via cygwin and mingw; without any optimalizations, just g++ test.cpp -o test. Both present such behaviour.
The file used in tests was 4GB long, full of zeros.
The weird pause in the middle of the chart could be explained with some kind of temporary I/O hangup, unrelated to this question.
Finally, if I wrap reading in infinite loop... the memory usage stops increasing after first iteration.
I think it has to do with some kind of internal cache building up till it's filled with whole file. How does it really work behind the scenes? How can I prevent that in a portable way??
I think, this is more an OS issue (or even an OS resource use reporting issue) than an issue with your program. Of course, it only uses 5 MB of memory: 1 MB for itself (libs, stack etc.) and 4 MB for the buffer. Whenever you do a fread(), the OS seems to "bind" part of the file to your process, and seems to release it not at the same speed. As memory use on your machine is low, this is not a problem: The OS just keeps the already read data "hanging around" longer than necessary, probably assuming, that your application might read it again, soon, and then it doesn't have to do that binding again.
If memory pressure was higher, than the OS is very likely to unbind the memory faster, so that jump on your memory usage history would be smaller.
I had the exact same problem, although in Java but it doesn't matter in this context. I solved it by reading much bigger chunks at a time. I also read 4Mb size chunks, but when I increased it to 100-200 Mb the problem went away. Perhaps it'll do that for you as well. I'm on Windows 7.

Need some Help Reading File

I need to read a file (not in binary mode). I already have a code to know the size of the file, and what i am searching for is how to read the file by = (the size of the file)-8276 bytes. These bytes that have been read, will be stored in a variable and I will need it to be written.
The size of the file is stored in an unsigned long variable. Can anybody help me?
I use Borland C++
Try this. Its been a while since I've touched Borland, so syntax might be off a bit. Consider it pseudocode, but you get the concept.
// assuming you've already created the file handle.
HANDLE fileHandle;
unsigned long fileSize;
unsigned long numBytesRead;
bool result;
// get the file size
fileSize = GetFileSize(theFile, NULL);
// check to see if filesize is greater than 8276 bytes.
// if so, read (fileSize - 8276)
if(fileSize >= 8276)
{
result = ReadFile(fileHandle, &objectYouAreReadingItTo, (fileSize - 8276), numBytesRead);
}
else
{
//...handle when fileSize is less than 8276 bytes...
}