Check the file-size without opening file in C++? - c++

I'm trying to get the filesize of a large file (12gb+) and I don't want to open the file to do so as I assume this would eat a lot of resources. Is there any good API to do so with? I'm in a Windows environment.

You should call GetFileSizeEx which is easier to use than the older GetFileSize. You will need to open the file by calling CreateFile but that's a cheap operation. Your assumption that opening a file is expensive, even a 12GB file, is false.
You could use the following function to get the job done:
__int64 FileSize(const wchar_t* name)
{
HANDLE hFile = CreateFile(name, GENERIC_READ,
FILE_SHARE_READ | FILE_SHARE_WRITE, NULL, OPEN_EXISTING,
FILE_ATTRIBUTE_NORMAL, NULL);
if (hFile==INVALID_HANDLE_VALUE)
return -1; // error condition, could call GetLastError to find out more
LARGE_INTEGER size;
if (!GetFileSizeEx(hFile, &size))
{
CloseHandle(hFile);
return -1; // error condition, could call GetLastError to find out more
}
CloseHandle(hFile);
return size.QuadPart;
}
There are other API calls that will return you the file size without forcing you to create a file handle, notably GetFileAttributesEx. However, it's perfectly plausible that this function will just open the file behind the scenes.
__int64 FileSize(const wchar_t* name)
{
WIN32_FILE_ATTRIBUTE_DATA fad;
if (!GetFileAttributesEx(name, GetFileExInfoStandard, &fad))
return -1; // error condition, could call GetLastError to find out more
LARGE_INTEGER size;
size.HighPart = fad.nFileSizeHigh;
size.LowPart = fad.nFileSizeLow;
return size.QuadPart;
}
If you are compiling with Visual Studio and want to avoid calling Win32 APIs then you can use _wstat64.
Here is a _wstat64 based version of the function:
__int64 FileSize(const wchar_t* name)
{
__stat64 buf;
if (_wstat64(name, &buf) != 0)
return -1; // error, could use errno to find out more
return buf.st_size;
}
If performance ever became an issue for you then you should time the various options on all the platforms that you target in order to reach a decision. Don't assume that the APIs that don't require you to call CreateFile will be faster. They might be but you won't know until you have timed it.

I've also lived with the fear of the price paid for opening a file and closing it just to get its size. And decided to ask the performance counter^ and see how expensive the operations really are.
This is the number of cycles it took to execute 1 file size query on the same file with the three methods. Tested on 2 files: 150 MB and 1.5 GB. Got +/- 10% fluctuations so they don't seem to be affected by actual file size. (obviously this depend on CPU but it gives you a good vantage point)
190 cycles - CreateFile, GetFileSizeEx, CloseHandle
40 cycles - GetFileAttributesEx
150 cycles - FindFirstFile, FindClose
The GIST with the code used^ is available here.
As we can see from this highly scientific :) test, slowest is actually the file opener. 2nd slowest is the file finder while the winner is the attributes reader. Now, in terms of reliability, CreateFile should be preferred over the other 2. But I still don't like the concept of opening a file just to read its size... Unless I'm doing size critical stuff, I'll go for the Attributes.
PS: When I'll have time I'll try to read sizes of files that are opened and am writing to. But not right now...

Another option using the FindFirstFile function
#include "stdafx.h"
#include <windows.h>
#include <tchar.h>
#include <stdio.h>
int _tmain(int argc, _TCHAR* argv[])
{
WIN32_FIND_DATA FindFileData;
HANDLE hFind;
LPCTSTR lpFileName = L"C:\\Foo\\Bar.ext";
hFind = FindFirstFile(lpFileName , &FindFileData);
if (hFind == INVALID_HANDLE_VALUE)
{
printf ("File not found (%d)\n", GetLastError());
return -1;
}
else
{
ULONGLONG FileSize = FindFileData.nFileSizeHigh;
FileSize <<= sizeof( FindFileData.nFileSizeHigh ) * 8;
FileSize |= FindFileData.nFileSizeLow;
_tprintf (TEXT("file size is %u\n"), FileSize);
FindClose(hFind);
}
return 0;
}

As of C++17, there is file_size as part of the standard library. (Then the implementor gets to decide how to do it efficiently!)

What about GetFileSize function?

Related

std::ifstream issue when running outside of IDE

I have a function that works fine when running inside of the Visual Studio debugging environment (with both the Debug and Release configurations), but when running the app outside of the IDE, just as an end-user would do, the program crashes. This happens with both the Debug and Release builds.
I'm aware of the differences that can exist between the Debug and Release configurations (optimizations, debug symbols, etc) and at least somewhat aware of the differences between running an app inside Visual Studio versus outside of it (debug heap, working directory, etc). I've looked at several of these things and none seem to address the issue. This is actually my first time posting to SO; normally I can find the solution from existing posts so I'm truly stumped!
I am able to attach a debugger and oddly enough I get two different error messages, based on whether I'm running the app on Windows 7 versus Windows 8.1. For Windows 7, the error is simply an access violation and it breaks right on the return statement. For Windows 8.1, it is a heap corruption error and it breaks on the construction of std::ifstream. In both cases, all of the local variables are populated correctly so I know it is not a matter of the function not being able to find the file or read its contents into the buffer data.
Also interestingly, the issue happens only about 20% of the time on Windows 8.1 and 100% of the time on Windows 7, though this may have something to do with the vastly different hardware these OS's are running on.
I'm not sure it makes any difference but the project type is a Win32 Desktop App and it initializes DirectX 11. You'll notice that the file type is interpreted as binary, which is correct as this function is primarily loading compiled shaders.
Here is the static member function LoadFile:
HRESULT MyClass::LoadFile(_In_ const CHAR* filename, _Out_ BYTE** data, _Out_ SIZE_T* length)
{
CHAR pwd[MAX_PATH];
GetCurrentDirectoryA(MAX_PATH, pwd);
std::string fullFilePath = std::string(pwd) + "\\" + filename;
std::ifstream file(fullFilePath, std::ifstream::binary);
if (file)
{
file.seekg(0, file.end);
*length = (SIZE_T)file.tellg();
file.seekg(0, file.beg);
*data = new BYTE[*length];
file.read(reinterpret_cast<CHAR*>(*data), *length);
if (file) return S_OK;
}
return E_FAIL;
}
UPDATE:
Interestingly, if I allocate std::ifstream file on the heap and do not delete it, the issue goes away. There must be something about the destruction of ifstream that is causing an issue in my case.
You don't check the return value of GetCurrentDirectoryA - maybe your current directory name is too long or something?
If you are already using Win32 (not portable!), use GetFileSize to get file size rather than doing seek
Better yet, use boost to write portable code
Switch on all warnings in compiler options
Enable ios exceptions
Okay, I gave up on trying to use ifstream. Apparently I'm not the only one that has this issue...just search "ifstream destructor crash".
Since this app is based on DirectX and will only be run on Windows, I went the Windows API route and everything works perfectly.
Working code, in case anyone cares:
HRESULT MyClass::LoadFile(_In_ const CHAR* filename, _Out_ BYTE** data, _Out_ SIZE_T* length)
{
CHAR pwd[MAX_PATH];
GetCurrentDirectoryA(MAX_PATH, pwd);
string fullFilePath = string(pwd) + "\\" + filename;
WIN32_FIND_DATAA fileData;
HANDLE file = FindFirstFileA(fullFilePath.c_str(), &fileData);
if (file == INVALID_HANDLE_VALUE) return E_FAIL;
file = CreateFileA(fullFilePath.c_str(),
GENERIC_READ,
FILE_SHARE_READ,
NULL,
OPEN_EXISTING,
FILE_ATTRIBUTE_NORMAL,
NULL);
if (file == INVALID_HANDLE_VALUE) return E_FAIL;
*length = (SIZE_T)fileData.nFileSizeLow;
*data = new BYTE[*length];
DWORD bytesRead;
if (ReadFile(file, *data, *length, &bytesRead, NULL) == FALSE || bytesRead != *length)
{
delete[] *data;
*length = 0;
CloseHandle(file);
return E_FAIL;
}
CloseHandle(file);
return S_OK;
}

What is the fastest way to read a file in disk in c++?

I am writing a program to check whether a file is PE file or not. For that, I need to read only the file headers of files(which I guess do not occupy more than first 1024 bytes of a file).
I tried using creatfile() + readfile() combination which turns out be slower because I am iterating through all the files in system drive. It is taking 15-20 minutes just to iterate through them.
Can you please tell some alternate approach to open and read the files to make it faster?
Note : Please note that I do NOT need to read the file in whole. I just need to read the initial part of the file -- DOS header, PE header etc which I guess do not occupy more than first 512 bytes of the file.
Here is my code :
bool IsPEFile(const String filePath)
{
HANDLE hFile = CreateFile(filePath.c_str(),
GENERIC_READ,
FILE_SHARE_READ | FILE_SHARE_WRITE,
NULL,
OPEN_EXISTING,
FILE_ATTRIBUTE_NORMAL,
NULL);
DWORD dwBytesRead = 0;
const DWORD CHUNK_SIZE = 2048;
BYTE szBuffer[CHUNK_SIZE] = {0};
LONGLONG size;
LARGE_INTEGER li = {0};
if (hFile != INVALID_HANDLE_VALUE)
{
if(GetFileSizeEx(hFile, &li) && li.QuadPart > 0)
{
size = li.QuadPart;
ReadFile(hFile, szBuffer, CHUNK_SIZE, &dwBytesRead, NULL);
if(dwBytesRead > 0 && (WORDPTR(szBuffer[0]) == ('M' << 8) + 'Z' || WORDPTR(szBuffer[0]) == ('Z' << 8) + 'M'))
{
LONGLONG ne_pe_header = DWORDPTR(szBuffer[0x3c]);
WORD signature = 0;
if(ne_pe_header <= dwBytesRead-2)
{
signature = WORDPTR(szBuffer[ne_pe_header]);
}
else if (ne_pe_header < size )
{
SetFilePointer(hFile, ne_pe_header, NULL, FILE_BEGIN);
ReadFile(hFile, &signature, sizeof(signature), &dwBytesRead, NULL);
if (dwBytesRead != sizeof(signature))
{
return false;
}
}
if(signature == 0x4550) // PE file
{
return true;
}
}
}
CloseHandle(hFile);
}
return false;
}
Thanks in advance.
I think you're hitting the inherent limitations of mechanical hard disk drives. You didn't mention whether you're using a HDD or a solid-state disk, but I assume a HDD given that your file accesses are slow.
HDDs can read data at about 100 MB/s sequentially, but seek time is a bit over 10 ms. This means that if you seek to a certain location (10 ms), you might as well read a megabyte of data (another 10 ms). This also means that you can access only less than 100 files per second.
So, in your case it doesn't matter much whether you're reading the first 512 bytes of a file or the first hundred kilobytes of a file.
Hardware is cheap, programmer time is expensive. Your best bet is to purchase a solid-state disk drive if your file accesses are too slow. I predict that eventually all computers will have solid-state disk drives.
Note: if the bottleneck is the HDD, there is nothing you can do about it other than to replace the HDD with better technology. Practically all file access mechanisms are equally slow. The only thing you can do about it is to read only the initial part of a file if the file is really really large such as multiple megabytes. But based on your code example you're already doing that.
For faster file IO, you need to use CreateFile and ReadFile APIs of Win32.
If you want to speed up, you can use file buffering and make file non-blocking by using overlapped IO or IOCP.
See this example for help: https://msdn.microsoft.com/en-us/library/windows/desktop/bb540534%28v=vs.85%29.aspx
And I think that FILE and fstream of C and C++ respectively are not faster than Win32.

FILE_FLAG_IO_BUFFERING slows down synchronous read operation

Reading a set of files with no buffering(skipping the file cache) using the flag FILE_FLAG_IO_BUFFERING should be faster than the normal reading(without using this flag). The reason for that to be faster is that 'no buffering' mechanism will skip the system file cache and will directly read into application's buffer.
The application is run in cold environment (after disk defragmentation, machine restart) so that the system file cache is not cached with the concerned files before the run.
This is from msdn documentation on these APIs and flags.
However, I experience a totally different performance behavior. I read a set of files synchronously one after the other, after the file handles have been created using the FILE_FLAG_IO_BUFFERING flag. The time it takes to read the set of files is 29 sec. Where as, if I had read normally without using this flag (again in the cold run of the application when file cache does not hold the concerned files), the time it takes is around 24 secs.
Details:
Total number of files: 1939
Total file size(sum of all): 57 MB
With FLAG_IO_NO_BUFFERING: 29 secs (time taken to read)
Without FLAG_IO_NO_BUFFERING: 24 secs (time taken to read)
Here is the code that implements the read:
DWORD ReadFiles(std::vector<std::string> &filePathNameVectorRef)
{
long totalBytesRead = 0;
for(all file in filePathNameVectorRef)
totalBytesRead += Read_Synchronous(file);
return totalBytesRead;
}
DWORD Read_Synchronous(const char * filePathName)
{
DWORD accessMode = GENERIC_READ;
DWORD shareMode = 0;
DWORD createDisposition = OPEN_EXISTING;
DWORD flags = FILE_FLAG_NO_BUFFERING;
HANDLE handle = INVALID_HANDLE_VALUE;
DWORD fileSize;
DWORD bytesRead = 0;
DWORD bytesToRead = 0;
LARGE_INTEGER li;
char * buffer;
BOOL success = false;
handle = CreateFile(filePathName, accessMode, shareMode, NULL, createDisposition, flags, NULL);
if(handle == INVALID_HANDLE_VALUE)
return 0;
GetFileSizeEx(handle, &li);
fileSize = (DWORD)li.QuadPart;
bytesToRead = (fileSize/g_bytesPerPhysicalSector)*g_bytesPerPhysicalSector;
buffer = static_cast<char *>(VirtualAlloc(0, bytesToRead, MEM_COMMIT, PAGE_READWRITE));
if(buffer == NULL)
goto RETURN;
success = ReadFile(handle, buffer, bytesToRead, &bytesRead, NULL);
if(!success){
fprintf(stdout, "\n Error occured: %d", GetLastError());
return 0;
}
free(buffer);
RETURN:
CloseHandle(handle);
return bytesRead;
}
Please share your thoughts on the reason you think this code is running slower than when the FILE_FLAG_NO_BUFFERING is not used. Thanks.
I expect that what you are measuring is the time to open and close the files. There are rather a lot of files. You should be able to read 57MB from a disk in around one second. So the overhead would appear to be the file opening rather than the reading. You should try again with fewer, but larger, files. Create, say, 20 100MB files and read those. It looks like, on your system at least, that it is slower to open files with FILE_FLAG_NO_BUFFERING than without.
In any case, don't expect FILE_FLAG_NO_BUFFERING to speed things up. The time spent copying from the file handle's buffer to your buffer is trivial in comparison to pulling the data off the disk.

Faster way to get File Size information C++

I have a function to get a FileSize of a file. I am running this on WinCE. Here is my current code which seems particularily slow
int Directory::GetFileSize(const std::string &filepath)
{
int filesize = -1;
#ifdef linux
struct stat fileStats;
if(stat(filepath.c_str(), &fileStats) != -1)
filesize = fileStats.st_size;
#else
std::wstring widePath;
Unicode::AnsiToUnicode(widePath, filepath);
HANDLE hFile = CreateFile(widePath.c_str(), 0, FILE_SHARE_READ | FILE_SHARE_WRITE, 0, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, 0);
if (hFile > 0)
{
filesize = ::GetFileSize( hFile, NULL);
}
CloseHandle(hFile);
#endif
return filesize;
}
At least for Windows, I think I'd use something like this:
__int64 Directory::GetFileSize(std::wstring const &path) {
WIN32_FIND_DATAW data;
HANDLE h = FindFirstFileW(path.c_str(), &data);
if (h == INVALID_HANDLE_VALUE)
return -1;
FindClose(h);
return data.nFileSizeLow | (__int64)data.nFileSizeHigh << 32;
}
If the compiler you're using supports it, you might want to use long long instead of __int64. You probably do not want to use int though, as that will only work correctly for files up to 2 gigabytes, and files larger than that are now pretty common (though perhaps not so common on a WinCE device).
I'd expect this to be faster than most other methods though. It doesn't require opening the file itself at all, just finding the file's directory entry (or, in the case of something like NTFS, its master file table entry).
Your solution is already rather fast to query the size of a file.
Under Windows, at least for NTFS and FAT, the file system driver will keep the file size in the cache, so it is rather fast to query it. The most time-consuming work involved is switching from user-mode to kernel-mode, rather than the file system driver's processing.
If you want to make it even faster, you have to use your own cache policy in user-mode, e.g. a special hash table, to avoid switching from user-mode to kernel-mode. But I don't recommend you to do that, because you will gain little performance.
PS: You'd better avoid the statement Unicode::AnsiToUnicode(widePath, filepath); in your function body. This function is rather time-consuming.
Just an idea (I haven't tested it), but I would expect
GetFileAttributesEx to be fastest at the system level. It
avoids having to open the file, and logically, I would expect it
to be faster than FindFirstFile, since it doesn't have to
maintain any information for continuing the search.
You could roll your own but I don't see why your approach is slow:
int Get_Size( string path )
{
// #include <fstream>
FILE *pFile = NULL;
// get the file stream
fopen_s( &pFile, path.c_str(), "rb" );
// set the file pointer to end of file
fseek( pFile, 0, SEEK_END );
// get the file size
int Size = ftell( pFile );
// return the file pointer to begin of file if you want to read it
// rewind( pFile );
// close stream and release buffer
fclose( pFile );
return Size;
}

Reading large binary files in small parts

I want to read a file from hard disk in size up to ~4-5GB. But not whole at once but in parts of ~100MB in sequence. I want to make it simple and fast as possible, but now I see that that the standard methods from C++ will not work for files bigger than 2GB.
I use Visual Studio 2008, C++/CLI. Any suggestions? I try to use CreateFile, ReadFile but for me it makes more problems than really works, or I use them wrong for reading a big file in parts.
EDIT: Sample code:
Creating handle
hFile = CreateFile(result,
GENERIC_READ,
FILE_SHARE_READ,
NULL,
OPEN_EXISTING,
FILE_ATTRIBUTE_NORMAL
|FILE_FLAG_NO_BUFFERING
| FILE_FLAG_OVERLAPPED,
0);
Reading
lpOverlapped = new OVERLAPPED;
lpOverlapped->hEvent = CreateEvent(NULL, TRUE, FALSE, NULL);
lpOverlapped->Offset=10;
lpOverlapped->OffsetHigh=0;
DWORD howMuchWasRead;
BOOLEAN error = false;
do {
this->lastError = NO_ERROR;
BOOL bRet = ReadFile(this->hFile,this->fileBuffer,this->currentBufferSize,&howMuchWasRead,lpOverlapped);
this->lastError = GetLastError();
if (this->lastError == ERROR_IO_PENDING){
while(!HasOverlappedIoCompleted(this->lpOverlapped)){}
error = true;
} else {
error = false;
}
} while (error == true);
This version now returns me ERROR_INVALID_PARAMETER 87 (0x57), for 4GB .iso file, buffer size is 100MB.
You can map parts of the file into the address space of your process using CreateFile, CreateFileMapping and MapViewOfFile.
You can read the file sequentially without any problems.
The limitations is that fseek uses a long parameter for the offset when you want to seek. If you don't reposition in the file, or the offset is always less than 2GB, there is no problem.
ReadFile will handle files larger than 2GB, maybe you can rephrase your question so we can help you figure out the problems you are having with that.