I have a function to get a FileSize of a file. I am running this on WinCE. Here is my current code which seems particularily slow
int Directory::GetFileSize(const std::string &filepath)
{
int filesize = -1;
#ifdef linux
struct stat fileStats;
if(stat(filepath.c_str(), &fileStats) != -1)
filesize = fileStats.st_size;
#else
std::wstring widePath;
Unicode::AnsiToUnicode(widePath, filepath);
HANDLE hFile = CreateFile(widePath.c_str(), 0, FILE_SHARE_READ | FILE_SHARE_WRITE, 0, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, 0);
if (hFile > 0)
{
filesize = ::GetFileSize( hFile, NULL);
}
CloseHandle(hFile);
#endif
return filesize;
}
At least for Windows, I think I'd use something like this:
__int64 Directory::GetFileSize(std::wstring const &path) {
WIN32_FIND_DATAW data;
HANDLE h = FindFirstFileW(path.c_str(), &data);
if (h == INVALID_HANDLE_VALUE)
return -1;
FindClose(h);
return data.nFileSizeLow | (__int64)data.nFileSizeHigh << 32;
}
If the compiler you're using supports it, you might want to use long long instead of __int64. You probably do not want to use int though, as that will only work correctly for files up to 2 gigabytes, and files larger than that are now pretty common (though perhaps not so common on a WinCE device).
I'd expect this to be faster than most other methods though. It doesn't require opening the file itself at all, just finding the file's directory entry (or, in the case of something like NTFS, its master file table entry).
Your solution is already rather fast to query the size of a file.
Under Windows, at least for NTFS and FAT, the file system driver will keep the file size in the cache, so it is rather fast to query it. The most time-consuming work involved is switching from user-mode to kernel-mode, rather than the file system driver's processing.
If you want to make it even faster, you have to use your own cache policy in user-mode, e.g. a special hash table, to avoid switching from user-mode to kernel-mode. But I don't recommend you to do that, because you will gain little performance.
PS: You'd better avoid the statement Unicode::AnsiToUnicode(widePath, filepath); in your function body. This function is rather time-consuming.
Just an idea (I haven't tested it), but I would expect
GetFileAttributesEx to be fastest at the system level. It
avoids having to open the file, and logically, I would expect it
to be faster than FindFirstFile, since it doesn't have to
maintain any information for continuing the search.
You could roll your own but I don't see why your approach is slow:
int Get_Size( string path )
{
// #include <fstream>
FILE *pFile = NULL;
// get the file stream
fopen_s( &pFile, path.c_str(), "rb" );
// set the file pointer to end of file
fseek( pFile, 0, SEEK_END );
// get the file size
int Size = ftell( pFile );
// return the file pointer to begin of file if you want to read it
// rewind( pFile );
// close stream and release buffer
fclose( pFile );
return Size;
}
Related
I have a big .txt file (over 1gb). While searching a way to open it fast I found mapping.
I managed to use CreateFile(), then I made a char buffer[] and finally put the file contents in the buffer with ReadFile(). The problem is that the file is too big, so I can't load it all at once into the buffer, because I can't make an array that big.
I think the solution would be to open and close the file at specified locations in the .txt file and get a few of the file contents each time. The only source I found explaining mapping was on MSDN but I can't find out how to do it.
So in the end, how do I read a big file with a mapping?
HANDLE my_File = CreateFileA("words.txt", GENERIC_READ, 0, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
if (my_File == INVALID_HANDLE_VALUE)
{
cout << "Failed to open file" << endl;
return 0;
}
constexpr size_t BUFFSIZE = 1000000;
char buffer[BUFFSIZE];
DWORD dwBytesToRead = BUFFSIZE - 1;
DWORD dwBytesRead = 0;
BOOL my_Bool = ReadFile(my_File,(void*)buffer, dwBytesToRead, &dwBytesRead, NULL);
if (dwBytesRead > 0)
{
buffer[dwBytesRead] = '\0';
cout << "FILE IS: " << buffer << endl;
}
CloseHandle(my_File);
I think you are confused. The whole purpose of mapping part or all of a file into memory is to avoid the need to buffer the data yourself. Instead, the OS takes care of that for you, allowing you to access the contents of the file via a pointer, just like you would any other in-memory data structure.
Only you can decide if that's the best solution for you. In a 32 bit app, 1GB is a lot of addressing space to find. In a 64 bit app there is no such problem. As mentioned in the comments, reading the file in chunks into a smaller buffer can be a better bet, especially if you want to process it sequentially.
For some example code on how to memory map a file, see:
How to CreateFileMapping in C++?
I am porting over c++ code from linux to windows. I am currently using Visual Studio 2013 to port my code.
I need to read a binary file and am using this portion of c++ code:
// Open the stream
std::ifstream is("myfile.bin");
// Determine the file length
is.seekg(0, std::ios_base::end);
std::size_t size=is.tellg();
is.seekg(0, std::ios_base::begin);
// Create a vector to store the data
int* Data = new int[size/sizeof(int)];
// Load the data
is.read((char*) &Data[0], size);
// Close the file
is.close();
In linux, the size of my binary file is correctly found to be 744mb. However, in windows, the size of my binary file is incorrectly found to be >4GB. How can I correct this issue?
Change std::ifstream is("myfile.bin"); to std::ifstream is("myfile.bin", std::ios::binary);
With your current default open mode, the compiler choses "char" mode. In Linux chars in files are UTF8, first 128 positions are 1-byte char. But for memory UTF32, 4-bytes per char, is used. In Windows chars are "wide-chars", 2-bytes per char.
I finally had the time to actually run this myself, though I had to fix a couple of things, like ios_base::beg instead of begin (different function) Also, as mentioned, the array allocation should be this int* Data = new int[size / sizeof(int) + 1]; // At most one extra int
I found your problem: you're not in the right directory. Check if you successfully opened the file or not. If you don't, then you get a huge garbage value (probably -1, but unsigned, so massive) for size.
Try this to find your directory in Windows: (probably need Windows.h or something that I "just had" already)
char dirBuf[256];
GetCurrentDirectory(256, dirBuf);
cout << "Current directory is: " << dirBuf << endl;
See if that's where your file is and move it accordingly. Or specify the ENTIRE path in the constructor to ifstream.
Also, it has nothing to do with ios::binary or not. Works fine both ways, or fails if the file isn't there.
std::size_t size=is.tellg();
The standard doesn't require tellg to return the byte offset from the beginning of the file. In general, this may not be a reliable way to get the size of the file, though it probably does what you expect on Linux and Windows.
The return type of the tellg method is std::basic_stream::pos_type, so you're starting with an implicit conversion to std::size_t which may or may not be appropriate. In a 32-bit build, for example, it's conceivable that the size of a file could be larger than a std::size_t can represent.
But the root problem is that you're not checking for errors. If you have exceptions disabled, then tellg reports an error by returning pos_type(-1). When you cast that to an unsigned type (which std::size_t is), then you get a very large value. I suspect you failed to open the file, and since you didn't detect that error, the seekg and the tellg failed. You then coerced pos_type(-1) to a std::size_t, which made it look like the file was huge.
You also have the problems others have noted: failing to open the file in binary mode and computing the wrong size for the buffer when the file isn't a multiple of the size of an int.
The most reliable to get the file size is to use the OS's API. On Windows, you can do this instead:
// Open the file. [TODO: Get the file name in wide characters and use
// CreateFileW instead. If the file name contains characters not
// representable by the user's ANSI codepage, then CreateFileA will fail.]
HANDLE hfile = CreateFileA("myfile.bin", GENERIC_READ, FILE_SHARE_READ,
nullptr, OPEN_EXISTING,
FILE_ATTRIBUTE_NORMAL | FILE_FLAG_SEQUENTIAL_SCAN,
nullptr);
if (hfile == INVALID_HANDLE_VALUE) { error handling here }
// Figure out how big it is.
LARGE_INTEGER li_size;
if (!GetFileSizeEx(hfile, &li_size)) { error handling here }
// TODO: On a 32-bit build, this won't be able to handle huge files,
// so check that here.
std::size_t size = li_size.QuadPart;
// Create a buffer to store the data, being careful to round up to a
// multiple of sizeof(int). [TODO: Use a std::vector instead.]
int* Data = new int[(size + sizeof(int) - 1) / sizeof(int)];
// Load the data.
const DWORD BytesToRead = static_cast<DWORD>(size);
DWORD BytesRead = 0;
if (!ReadFile(hfile, Data, &BytesRead, nullptr) || BytesRead < BytesToRead) {
error handling here
}
// Close the file
CloseHandle(hfile);
int* Data = new int[size/sizeof(int)];
Why are you doing this? You're dividing the size by 4. You don't want to do this. It should just be int* Data = new int[size]
Also, it should be std::ifstream f("filename.bin", std::ios::binary);
I am writing a program to check whether a file is PE file or not. For that, I need to read only the file headers of files(which I guess do not occupy more than first 1024 bytes of a file).
I tried using creatfile() + readfile() combination which turns out be slower because I am iterating through all the files in system drive. It is taking 15-20 minutes just to iterate through them.
Can you please tell some alternate approach to open and read the files to make it faster?
Note : Please note that I do NOT need to read the file in whole. I just need to read the initial part of the file -- DOS header, PE header etc which I guess do not occupy more than first 512 bytes of the file.
Here is my code :
bool IsPEFile(const String filePath)
{
HANDLE hFile = CreateFile(filePath.c_str(),
GENERIC_READ,
FILE_SHARE_READ | FILE_SHARE_WRITE,
NULL,
OPEN_EXISTING,
FILE_ATTRIBUTE_NORMAL,
NULL);
DWORD dwBytesRead = 0;
const DWORD CHUNK_SIZE = 2048;
BYTE szBuffer[CHUNK_SIZE] = {0};
LONGLONG size;
LARGE_INTEGER li = {0};
if (hFile != INVALID_HANDLE_VALUE)
{
if(GetFileSizeEx(hFile, &li) && li.QuadPart > 0)
{
size = li.QuadPart;
ReadFile(hFile, szBuffer, CHUNK_SIZE, &dwBytesRead, NULL);
if(dwBytesRead > 0 && (WORDPTR(szBuffer[0]) == ('M' << 8) + 'Z' || WORDPTR(szBuffer[0]) == ('Z' << 8) + 'M'))
{
LONGLONG ne_pe_header = DWORDPTR(szBuffer[0x3c]);
WORD signature = 0;
if(ne_pe_header <= dwBytesRead-2)
{
signature = WORDPTR(szBuffer[ne_pe_header]);
}
else if (ne_pe_header < size )
{
SetFilePointer(hFile, ne_pe_header, NULL, FILE_BEGIN);
ReadFile(hFile, &signature, sizeof(signature), &dwBytesRead, NULL);
if (dwBytesRead != sizeof(signature))
{
return false;
}
}
if(signature == 0x4550) // PE file
{
return true;
}
}
}
CloseHandle(hFile);
}
return false;
}
Thanks in advance.
I think you're hitting the inherent limitations of mechanical hard disk drives. You didn't mention whether you're using a HDD or a solid-state disk, but I assume a HDD given that your file accesses are slow.
HDDs can read data at about 100 MB/s sequentially, but seek time is a bit over 10 ms. This means that if you seek to a certain location (10 ms), you might as well read a megabyte of data (another 10 ms). This also means that you can access only less than 100 files per second.
So, in your case it doesn't matter much whether you're reading the first 512 bytes of a file or the first hundred kilobytes of a file.
Hardware is cheap, programmer time is expensive. Your best bet is to purchase a solid-state disk drive if your file accesses are too slow. I predict that eventually all computers will have solid-state disk drives.
Note: if the bottleneck is the HDD, there is nothing you can do about it other than to replace the HDD with better technology. Practically all file access mechanisms are equally slow. The only thing you can do about it is to read only the initial part of a file if the file is really really large such as multiple megabytes. But based on your code example you're already doing that.
For faster file IO, you need to use CreateFile and ReadFile APIs of Win32.
If you want to speed up, you can use file buffering and make file non-blocking by using overlapped IO or IOCP.
See this example for help: https://msdn.microsoft.com/en-us/library/windows/desktop/bb540534%28v=vs.85%29.aspx
And I think that FILE and fstream of C and C++ respectively are not faster than Win32.
I am trying to figure out a (hopefully easy) way to read a large, unstructured file without bumping into the edge of a buffer. An example is helpful here.
Imagine you are trying to do some data-recovery of a 16GB flash-drive and have saved a dump of the drive to a 16GB file. You want to scan through the image, looking for certain items of interest. If the file were smaller, you could read the entire thing into a memory buffer (let’s say 1MB) and do a simple scan through the buffer. However, because it is too big to read in all at once, you need to read it in chunks. The problem is that an item of interest may not be perfectly aligned so as to fall within a single 1MB buffer. In other words, it may end up straddling the edge of the buffer so that it starts at the end of the buffer during one read, and ends in the next one (or even further).
At one time in the past, I dealt with this by using two buffers and copying the second one to the first one to create a sort of sliding window, however I imagine that this should be a common enough scenario that there are better, existing solutions. I looked into memory-mapped files, thinking that they let you read the file by simply increasing the array index/pointer, but I ended up in the exact same situation as before due to the limit of the map view size. I tried looking for some practical examples of using MapViewOfFile with offsets, but all I could find were contrived examples that skipped that.
How is this situation normally handled?
If you are running in a 64 bit environment, I would just use memory mapped files. There is no (reasonable) memory limit for a process. You can read the file in, even jump around, and the OS will swap memory to and from disk.
Here's some basic information:
http://msdn.microsoft.com/en-us/library/ms810613.aspx
And an example of a file viewer here:
http://www.catch22.net/tuts/memory-techniques-part-1
This case works on a 2.8GB file in x64, but fails in win32 because it cannot allocate more than 2GB per process. It is very fast since it touches only the first and last byte in the pBuf array. Modifying the method to traverse the buffer and count the number of 'zero' bytes works as expected. You can watch the memory footprint go up as it does it but that memory is only virtually allocated.
#include "stdafx.h"
#include <string>
#include <Windows.h>
TCHAR szName[] = TEXT( pathToFile );
int _tmain(int argc, _TCHAR* argv[])
{
HANDLE hMapFile;
char* pBuf;
HANDLE file = CreateFile( szName, GENERIC_READ, FILE_SHARE_READ, 0, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, 0);
if ( file == NULL )
{
_tprintf(TEXT("Could not open file object (%d).\n"),
GetLastError());
return 1;
}
unsigned int length = GetFileSize(file, 0);
printf( "Length = %u\n", length );
hMapFile = CreateFileMapping( file, 0, PAGE_READONLY, 0, 0, 0 );
if (hMapFile == NULL)
{
_tprintf(TEXT("Could not create file mapping object (%d).\n"), GetLastError());
return 1;
}
pBuf = (char*) MapViewOfFile(hMapFile, FILE_MAP_READ, 0,0, length);
if (pBuf == NULL)
{
_tprintf(TEXT("Could not map view of file (%d).\n"), GetLastError());
CloseHandle(hMapFile);
return 1;
}
printf("First Byte: 0x%02x\n", pBuf[0] );
printf("Last Byte: 0x%02x\n", pBuf[length-1] );
UnmapViewOfFile(pBuf);
CloseHandle(hMapFile);
return 0;
}
I'm trying to get the filesize of a large file (12gb+) and I don't want to open the file to do so as I assume this would eat a lot of resources. Is there any good API to do so with? I'm in a Windows environment.
You should call GetFileSizeEx which is easier to use than the older GetFileSize. You will need to open the file by calling CreateFile but that's a cheap operation. Your assumption that opening a file is expensive, even a 12GB file, is false.
You could use the following function to get the job done:
__int64 FileSize(const wchar_t* name)
{
HANDLE hFile = CreateFile(name, GENERIC_READ,
FILE_SHARE_READ | FILE_SHARE_WRITE, NULL, OPEN_EXISTING,
FILE_ATTRIBUTE_NORMAL, NULL);
if (hFile==INVALID_HANDLE_VALUE)
return -1; // error condition, could call GetLastError to find out more
LARGE_INTEGER size;
if (!GetFileSizeEx(hFile, &size))
{
CloseHandle(hFile);
return -1; // error condition, could call GetLastError to find out more
}
CloseHandle(hFile);
return size.QuadPart;
}
There are other API calls that will return you the file size without forcing you to create a file handle, notably GetFileAttributesEx. However, it's perfectly plausible that this function will just open the file behind the scenes.
__int64 FileSize(const wchar_t* name)
{
WIN32_FILE_ATTRIBUTE_DATA fad;
if (!GetFileAttributesEx(name, GetFileExInfoStandard, &fad))
return -1; // error condition, could call GetLastError to find out more
LARGE_INTEGER size;
size.HighPart = fad.nFileSizeHigh;
size.LowPart = fad.nFileSizeLow;
return size.QuadPart;
}
If you are compiling with Visual Studio and want to avoid calling Win32 APIs then you can use _wstat64.
Here is a _wstat64 based version of the function:
__int64 FileSize(const wchar_t* name)
{
__stat64 buf;
if (_wstat64(name, &buf) != 0)
return -1; // error, could use errno to find out more
return buf.st_size;
}
If performance ever became an issue for you then you should time the various options on all the platforms that you target in order to reach a decision. Don't assume that the APIs that don't require you to call CreateFile will be faster. They might be but you won't know until you have timed it.
I've also lived with the fear of the price paid for opening a file and closing it just to get its size. And decided to ask the performance counter^ and see how expensive the operations really are.
This is the number of cycles it took to execute 1 file size query on the same file with the three methods. Tested on 2 files: 150 MB and 1.5 GB. Got +/- 10% fluctuations so they don't seem to be affected by actual file size. (obviously this depend on CPU but it gives you a good vantage point)
190 cycles - CreateFile, GetFileSizeEx, CloseHandle
40 cycles - GetFileAttributesEx
150 cycles - FindFirstFile, FindClose
The GIST with the code used^ is available here.
As we can see from this highly scientific :) test, slowest is actually the file opener. 2nd slowest is the file finder while the winner is the attributes reader. Now, in terms of reliability, CreateFile should be preferred over the other 2. But I still don't like the concept of opening a file just to read its size... Unless I'm doing size critical stuff, I'll go for the Attributes.
PS: When I'll have time I'll try to read sizes of files that are opened and am writing to. But not right now...
Another option using the FindFirstFile function
#include "stdafx.h"
#include <windows.h>
#include <tchar.h>
#include <stdio.h>
int _tmain(int argc, _TCHAR* argv[])
{
WIN32_FIND_DATA FindFileData;
HANDLE hFind;
LPCTSTR lpFileName = L"C:\\Foo\\Bar.ext";
hFind = FindFirstFile(lpFileName , &FindFileData);
if (hFind == INVALID_HANDLE_VALUE)
{
printf ("File not found (%d)\n", GetLastError());
return -1;
}
else
{
ULONGLONG FileSize = FindFileData.nFileSizeHigh;
FileSize <<= sizeof( FindFileData.nFileSizeHigh ) * 8;
FileSize |= FindFileData.nFileSizeLow;
_tprintf (TEXT("file size is %u\n"), FileSize);
FindClose(hFind);
}
return 0;
}
As of C++17, there is file_size as part of the standard library. (Then the implementor gets to decide how to do it efficiently!)
What about GetFileSize function?