I have a big .txt file (over 1gb). While searching a way to open it fast I found mapping.
I managed to use CreateFile(), then I made a char buffer[] and finally put the file contents in the buffer with ReadFile(). The problem is that the file is too big, so I can't load it all at once into the buffer, because I can't make an array that big.
I think the solution would be to open and close the file at specified locations in the .txt file and get a few of the file contents each time. The only source I found explaining mapping was on MSDN but I can't find out how to do it.
So in the end, how do I read a big file with a mapping?
HANDLE my_File = CreateFileA("words.txt", GENERIC_READ, 0, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
if (my_File == INVALID_HANDLE_VALUE)
{
cout << "Failed to open file" << endl;
return 0;
}
constexpr size_t BUFFSIZE = 1000000;
char buffer[BUFFSIZE];
DWORD dwBytesToRead = BUFFSIZE - 1;
DWORD dwBytesRead = 0;
BOOL my_Bool = ReadFile(my_File,(void*)buffer, dwBytesToRead, &dwBytesRead, NULL);
if (dwBytesRead > 0)
{
buffer[dwBytesRead] = '\0';
cout << "FILE IS: " << buffer << endl;
}
CloseHandle(my_File);
I think you are confused. The whole purpose of mapping part or all of a file into memory is to avoid the need to buffer the data yourself. Instead, the OS takes care of that for you, allowing you to access the contents of the file via a pointer, just like you would any other in-memory data structure.
Only you can decide if that's the best solution for you. In a 32 bit app, 1GB is a lot of addressing space to find. In a 64 bit app there is no such problem. As mentioned in the comments, reading the file in chunks into a smaller buffer can be a better bet, especially if you want to process it sequentially.
For some example code on how to memory map a file, see:
How to CreateFileMapping in C++?
Related
I've search the MSDN but did not find any information about sharing a same HANDLE with both WriteFile and ReadFile. NOTE:I did not use create_always flag, so there's no chance for the file being replaced with null file.
The reason I tried to use the same HANDLE was based on performance concerns. My code basically downloads some data(writes to a file) ,reads it immediately then delete it.
In my opinion, A file HANDLE is just an address of memory which is also an entrance to do a I/O job.
This is how the error occurs:
CreateFile(OK) --> WriteFile(OK) --> GetFileSize(OK) --> ReadFile(Failed) --> CloseHandle(OK)
If the WriteFile was called synchronized, there should be no problem on this ReadFile action, even the GetFileSize after WriteFile returns the correct value!!(new modified file size), but the fact is, ReadFile reads the value before modified (lpNumberOfBytesRead is always old value). A thought just came to my mind,caching!
Then I tried to learn more about Windows File Caching which I have no knowledge with. I even tried Flag FILE_FLAG_NO_BUFFERING, and FlushFileBuffers function but no luck. Of course I know I can do CloseHandle and CreateFile again between WriteFile and ReadFile, I just wonder if there's some possible way to achieve this without calling CreateFile again?
Above is the minimum about my question, down is the demo code I made for this concept:
int main()
{
HANDLE hFile = CreateFile(L"C://temp//TEST.txt", GENERIC_READ | GENERIC_WRITE, FILE_SHARE_READ, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL| FILE_FLAG_WRITE_THROUGH, NULL);
//step one write 12345 to file
std::string test = "12345";
char * pszOutBuffer;
pszOutBuffer = (char*)malloc(strlen(test.c_str()) + 1); //create buffer for 12345 plus a null ternimator
ZeroMemory(pszOutBuffer, strlen(test.c_str()) + 1); //replace null ternimator with 0
memcpy(pszOutBuffer, test.c_str(), strlen(test.c_str())); //copy 12345 to buffer
DWORD wmWritten;
WriteFile(hFile, pszOutBuffer, strlen(test.c_str()), &wmWritten, NULL); //write 12345 to file
//according to msdn this refresh the buffer
FlushFileBuffers(hFile);
std::cout << "bytes writen to file(num):"<< wmWritten << std::endl; //got output 5 here as expected, 5 bytes has bebn wrtten to file.
//step two getfilesize and read file
//get file size of C://temp//TEST.txt
DWORD dwFileSize = 0;
dwFileSize = GetFileSize(hFile, NULL);
if (dwFileSize == INVALID_FILE_SIZE)
{
return -1; //unable to get filesize
}
std::cout << "GetFileSize result is:" << dwFileSize << std::endl; //got output 5 here as expected
char * bufFstream;
bufFstream = (char*)malloc(sizeof(char)*(dwFileSize + 1)); //create buffer with filesize & a null terminator
memset(bufFstream, 0, sizeof(char)*(dwFileSize + 1));
std::cout << "created a buffer for ReadFile with size:" << dwFileSize + 1 << std::endl; //got output 6 as expected here
if (bufFstream == NULL) {
return -1;//ERROR_MEMORY;
}
DWORD nRead = 0;
bool bBufResult = ReadFile(hFile, bufFstream, dwFileSize, &nRead, NULL); //dwFileSize is 5 here
if (!bBufResult) {
free(bufFstream);
return -1; //copy file into buffer failed
}
std::cout << "nRead is:" << nRead << std::endl; //!!!got nRead 0 here!!!? why?
CloseHandle(hFile);
free(pszOutBuffer);
free(bufFstream);
return 0;
}
then the output is:
bytes writen to file(num):5
GetFileSize result is:5
created a buffer for ReadFile with size:6
nRead is:0
nRead should be 5 not 0.
Win32 files have a single file pointer, both for read and write; after the WriteFile it is at the end of the file, so if you try to read from it it will fail. To read what you just wrote you have to reposition the file pointer at the start of the file, using the SetFilePointer function.
Also, the FlushFileBuffer isn't needed - the operating system ensures that reads and writes on the file handle see the same state, regardless of the status of the buffers.
After first write file cursor points at file end. There is nothing to read. You can rewind it back to the beginning using SetFilePointer:
::DWORD const result(::SetFilePointer(hFile, 0, nullptr, FILE_BEGIN));
if(INVALID_SET_FILE_POINTER == result)
{
::DWORD const last_error(::GetLastError());
if(NO_ERROR != last_error)
{
// TODO do error handling...
}
}
when you try read file - from what position you try read it ?
FILE_OBJECT maintain "current" position (CurrentByteOffset member) which can be used as default position (for synchronous files only - opened without FILE_FLAG_OVERLAPPED !!) when you read or write file. and this position updated (moved on n bytes forward) after every read or write n bytes.
the best solution always use explicit file offset in ReadFile (or WriteFile). this offset in the last parameter OVERLAPPED lpOverlapped - look for Offset[High] member - the read operation starts at the offset that is specified in the OVERLAPPED structure
use this more effective and simply compare use special api call SetFilePointer which adjust CurrentByteOffset member in FILE_OBJECT (and this not worked for asynchronous file handles (created with FILE_FLAG_OVERLAPPED flag)
despite very common confusion - OVERLAPPED used not for asynchronous io only - this is simply additional parameter to ReadFile (or WriteFile) and can be used always - for any file handles
I am writing a program to check whether a file is PE file or not. For that, I need to read only the file headers of files(which I guess do not occupy more than first 1024 bytes of a file).
I tried using creatfile() + readfile() combination which turns out be slower because I am iterating through all the files in system drive. It is taking 15-20 minutes just to iterate through them.
Can you please tell some alternate approach to open and read the files to make it faster?
Note : Please note that I do NOT need to read the file in whole. I just need to read the initial part of the file -- DOS header, PE header etc which I guess do not occupy more than first 512 bytes of the file.
Here is my code :
bool IsPEFile(const String filePath)
{
HANDLE hFile = CreateFile(filePath.c_str(),
GENERIC_READ,
FILE_SHARE_READ | FILE_SHARE_WRITE,
NULL,
OPEN_EXISTING,
FILE_ATTRIBUTE_NORMAL,
NULL);
DWORD dwBytesRead = 0;
const DWORD CHUNK_SIZE = 2048;
BYTE szBuffer[CHUNK_SIZE] = {0};
LONGLONG size;
LARGE_INTEGER li = {0};
if (hFile != INVALID_HANDLE_VALUE)
{
if(GetFileSizeEx(hFile, &li) && li.QuadPart > 0)
{
size = li.QuadPart;
ReadFile(hFile, szBuffer, CHUNK_SIZE, &dwBytesRead, NULL);
if(dwBytesRead > 0 && (WORDPTR(szBuffer[0]) == ('M' << 8) + 'Z' || WORDPTR(szBuffer[0]) == ('Z' << 8) + 'M'))
{
LONGLONG ne_pe_header = DWORDPTR(szBuffer[0x3c]);
WORD signature = 0;
if(ne_pe_header <= dwBytesRead-2)
{
signature = WORDPTR(szBuffer[ne_pe_header]);
}
else if (ne_pe_header < size )
{
SetFilePointer(hFile, ne_pe_header, NULL, FILE_BEGIN);
ReadFile(hFile, &signature, sizeof(signature), &dwBytesRead, NULL);
if (dwBytesRead != sizeof(signature))
{
return false;
}
}
if(signature == 0x4550) // PE file
{
return true;
}
}
}
CloseHandle(hFile);
}
return false;
}
Thanks in advance.
I think you're hitting the inherent limitations of mechanical hard disk drives. You didn't mention whether you're using a HDD or a solid-state disk, but I assume a HDD given that your file accesses are slow.
HDDs can read data at about 100 MB/s sequentially, but seek time is a bit over 10 ms. This means that if you seek to a certain location (10 ms), you might as well read a megabyte of data (another 10 ms). This also means that you can access only less than 100 files per second.
So, in your case it doesn't matter much whether you're reading the first 512 bytes of a file or the first hundred kilobytes of a file.
Hardware is cheap, programmer time is expensive. Your best bet is to purchase a solid-state disk drive if your file accesses are too slow. I predict that eventually all computers will have solid-state disk drives.
Note: if the bottleneck is the HDD, there is nothing you can do about it other than to replace the HDD with better technology. Practically all file access mechanisms are equally slow. The only thing you can do about it is to read only the initial part of a file if the file is really really large such as multiple megabytes. But based on your code example you're already doing that.
For faster file IO, you need to use CreateFile and ReadFile APIs of Win32.
If you want to speed up, you can use file buffering and make file non-blocking by using overlapped IO or IOCP.
See this example for help: https://msdn.microsoft.com/en-us/library/windows/desktop/bb540534%28v=vs.85%29.aspx
And I think that FILE and fstream of C and C++ respectively are not faster than Win32.
I am trying to figure out a (hopefully easy) way to read a large, unstructured file without bumping into the edge of a buffer. An example is helpful here.
Imagine you are trying to do some data-recovery of a 16GB flash-drive and have saved a dump of the drive to a 16GB file. You want to scan through the image, looking for certain items of interest. If the file were smaller, you could read the entire thing into a memory buffer (let’s say 1MB) and do a simple scan through the buffer. However, because it is too big to read in all at once, you need to read it in chunks. The problem is that an item of interest may not be perfectly aligned so as to fall within a single 1MB buffer. In other words, it may end up straddling the edge of the buffer so that it starts at the end of the buffer during one read, and ends in the next one (or even further).
At one time in the past, I dealt with this by using two buffers and copying the second one to the first one to create a sort of sliding window, however I imagine that this should be a common enough scenario that there are better, existing solutions. I looked into memory-mapped files, thinking that they let you read the file by simply increasing the array index/pointer, but I ended up in the exact same situation as before due to the limit of the map view size. I tried looking for some practical examples of using MapViewOfFile with offsets, but all I could find were contrived examples that skipped that.
How is this situation normally handled?
If you are running in a 64 bit environment, I would just use memory mapped files. There is no (reasonable) memory limit for a process. You can read the file in, even jump around, and the OS will swap memory to and from disk.
Here's some basic information:
http://msdn.microsoft.com/en-us/library/ms810613.aspx
And an example of a file viewer here:
http://www.catch22.net/tuts/memory-techniques-part-1
This case works on a 2.8GB file in x64, but fails in win32 because it cannot allocate more than 2GB per process. It is very fast since it touches only the first and last byte in the pBuf array. Modifying the method to traverse the buffer and count the number of 'zero' bytes works as expected. You can watch the memory footprint go up as it does it but that memory is only virtually allocated.
#include "stdafx.h"
#include <string>
#include <Windows.h>
TCHAR szName[] = TEXT( pathToFile );
int _tmain(int argc, _TCHAR* argv[])
{
HANDLE hMapFile;
char* pBuf;
HANDLE file = CreateFile( szName, GENERIC_READ, FILE_SHARE_READ, 0, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, 0);
if ( file == NULL )
{
_tprintf(TEXT("Could not open file object (%d).\n"),
GetLastError());
return 1;
}
unsigned int length = GetFileSize(file, 0);
printf( "Length = %u\n", length );
hMapFile = CreateFileMapping( file, 0, PAGE_READONLY, 0, 0, 0 );
if (hMapFile == NULL)
{
_tprintf(TEXT("Could not create file mapping object (%d).\n"), GetLastError());
return 1;
}
pBuf = (char*) MapViewOfFile(hMapFile, FILE_MAP_READ, 0,0, length);
if (pBuf == NULL)
{
_tprintf(TEXT("Could not map view of file (%d).\n"), GetLastError());
CloseHandle(hMapFile);
return 1;
}
printf("First Byte: 0x%02x\n", pBuf[0] );
printf("Last Byte: 0x%02x\n", pBuf[length-1] );
UnmapViewOfFile(pBuf);
CloseHandle(hMapFile);
return 0;
}
I need help reading data off of the last cluster of a file using CreateFile() and then using ReadFile(). First I'm stuck with a zero result for my ReadFile() because I think I have incorrect permissions set up in CreateFile().
/**********CreateFile for volume ********/
HANDLE hDevice = INVALID_HANDLE_VALUE;
hDevice = CreateFile(L"\\\\.\\C:",
0,
FILE_SHARE_READ |
FILE_SHARE_WRITE,
NULL,
OPEN_EXISTING,
0,
NULL);
if (hDevice == INVALID_HANDLE_VALUE)
{
wcout << "error at hDevice at CreateFile "<< endl;
system("pause");
}
/******* Read file from the volume *********/
DWORD nRead;
TCHAR buff[4096];
if (BOOL fileFromVol = ReadFile(
hDevice,
buff,
4096,
&nRead,
NULL
) == 0) {
cout << "Error with fileFromVol" << "\n\n";
system("pause");
}
Next, I have all the cluster information and file information I need (file size, last cluster location of the file,# of clusters on disk, cluster size,etc). How do I set the pointer on the volume to start at a specfied cluster location so I can read/write data from it?
The main problem is that you specify 0 for dwDesiredAccess. In order to read the data you should specify FILE_READ_DATA.
On top of that I seriously question the use of TCHAR. That's appropriate for text when you need to support Windows 9x. On top of not needing to support Windows 9x, the data is not text. Your buffer should be of type unsigned char.
Obviously you need the buffer to be a multiple of the cluster size. You've hard coded 4096, but the real code should surely query the cluster size.
When either of these API calls fail, they indicate a failure reason in the last error value. You can obtain that by calling GetLastError. When your ReadFile fails it will return ERROR_ACCESS_DENIED.
You can seek in the volume by calling SetFilePointerEx. Again, you will need to seek to multiples of the cluster size.
LARGE_INTEGER dist;
dist.QuadPart = ClusterNum * ClusterSize;
BOOL res = SetFilePointerEx(hFile, dist, nullptr, FILE_BEGIN);
if (!res)
// handle error
If you are reading sequentially that there's no need to set the file pointer. The call to ReadFile will advance it automatically.
When doing random-access I/O, just don't mess with the file pointer stored in the file handle at all. Instead, use an OVERLAPPED structure and specify the location for each and every I/O operation.
This works even for synchronous I/O (if the file is opened without FILE_FLAG_OVERLAPPED).
Of course, as David mentioned you will get ERROR_ACCESS_DENIED if you perform operations using a file handle opened without sufficient access.
I have a function to get a FileSize of a file. I am running this on WinCE. Here is my current code which seems particularily slow
int Directory::GetFileSize(const std::string &filepath)
{
int filesize = -1;
#ifdef linux
struct stat fileStats;
if(stat(filepath.c_str(), &fileStats) != -1)
filesize = fileStats.st_size;
#else
std::wstring widePath;
Unicode::AnsiToUnicode(widePath, filepath);
HANDLE hFile = CreateFile(widePath.c_str(), 0, FILE_SHARE_READ | FILE_SHARE_WRITE, 0, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, 0);
if (hFile > 0)
{
filesize = ::GetFileSize( hFile, NULL);
}
CloseHandle(hFile);
#endif
return filesize;
}
At least for Windows, I think I'd use something like this:
__int64 Directory::GetFileSize(std::wstring const &path) {
WIN32_FIND_DATAW data;
HANDLE h = FindFirstFileW(path.c_str(), &data);
if (h == INVALID_HANDLE_VALUE)
return -1;
FindClose(h);
return data.nFileSizeLow | (__int64)data.nFileSizeHigh << 32;
}
If the compiler you're using supports it, you might want to use long long instead of __int64. You probably do not want to use int though, as that will only work correctly for files up to 2 gigabytes, and files larger than that are now pretty common (though perhaps not so common on a WinCE device).
I'd expect this to be faster than most other methods though. It doesn't require opening the file itself at all, just finding the file's directory entry (or, in the case of something like NTFS, its master file table entry).
Your solution is already rather fast to query the size of a file.
Under Windows, at least for NTFS and FAT, the file system driver will keep the file size in the cache, so it is rather fast to query it. The most time-consuming work involved is switching from user-mode to kernel-mode, rather than the file system driver's processing.
If you want to make it even faster, you have to use your own cache policy in user-mode, e.g. a special hash table, to avoid switching from user-mode to kernel-mode. But I don't recommend you to do that, because you will gain little performance.
PS: You'd better avoid the statement Unicode::AnsiToUnicode(widePath, filepath); in your function body. This function is rather time-consuming.
Just an idea (I haven't tested it), but I would expect
GetFileAttributesEx to be fastest at the system level. It
avoids having to open the file, and logically, I would expect it
to be faster than FindFirstFile, since it doesn't have to
maintain any information for continuing the search.
You could roll your own but I don't see why your approach is slow:
int Get_Size( string path )
{
// #include <fstream>
FILE *pFile = NULL;
// get the file stream
fopen_s( &pFile, path.c_str(), "rb" );
// set the file pointer to end of file
fseek( pFile, 0, SEEK_END );
// get the file size
int Size = ftell( pFile );
// return the file pointer to begin of file if you want to read it
// rewind( pFile );
// close stream and release buffer
fclose( pFile );
return Size;
}