Reading large binary files in small parts

Reading large binary files in small parts - c++

I want to read a file from hard disk in size up to ~4-5GB. But not whole at once but in parts of ~100MB in sequence. I want to make it simple and fast as possible, but now I see that that the standard methods from C++ will not work for files bigger than 2GB.
I use Visual Studio 2008, C++/CLI. Any suggestions? I try to use CreateFile, ReadFile but for me it makes more problems than really works, or I use them wrong for reading a big file in parts.
EDIT: Sample code:
Creating handle
hFile = CreateFile(result,
GENERIC_READ,
FILE_SHARE_READ,
NULL,
OPEN_EXISTING,
FILE_ATTRIBUTE_NORMAL
|FILE_FLAG_NO_BUFFERING
| FILE_FLAG_OVERLAPPED,
0);
Reading
lpOverlapped = new OVERLAPPED;
lpOverlapped->hEvent = CreateEvent(NULL, TRUE, FALSE, NULL);
lpOverlapped->Offset=10;
lpOverlapped->OffsetHigh=0;
DWORD howMuchWasRead;
BOOLEAN error = false;
do {
this->lastError = NO_ERROR;
BOOL bRet = ReadFile(this->hFile,this->fileBuffer,this->currentBufferSize,&howMuchWasRead,lpOverlapped);
this->lastError = GetLastError();
if (this->lastError == ERROR_IO_PENDING){
while(!HasOverlappedIoCompleted(this->lpOverlapped)){}
error = true;
} else {
error = false;
}
} while (error == true);
This version now returns me ERROR_INVALID_PARAMETER 87 (0x57), for 4GB .iso file, buffer size is 100MB.

You can map parts of the file into the address space of your process using CreateFile, CreateFileMapping and MapViewOfFile.

You can read the file sequentially without any problems.
The limitations is that fseek uses a long parameter for the offset when you want to seek. If you don't reposition in the file, or the offset is always less than 2GB, there is no problem.

ReadFile will handle files larger than 2GB, maybe you can rephrase your question so we can help you figure out the problems you are having with that.

Related

InternetReadFile filling buffer, but returning zero bytes read

I'm having a very strange problem whilst trying to download a file from the internet inside a C++ application written for Windows Compact 2013.
BOOL WWW::Read(char* buffer, DWORD buffer_size)
{
memset(buffer, 0, buffer_size);
m_dwBytesRead = 0;
BOOL bResult = InternetReadFile(m_handle, buffer, buffer_size, &m_dwBytesRead);
if (!bResult)
{
DWORD dwLastError = GetLastError();
TCHAR *err;
if (FormatMessage(FORMAT_MESSAGE_ALLOCATE_BUFFER | FORMAT_MESSAGE_FROM_SYSTEM,
NULL, dwLastError,
MAKELANGID(LANG_NEUTRAL, SUBLANG_DEFAULT), // default language
(LPTSTR)&err, 0, NULL))
{
LOGMSG(1, (TEXT("InternetReadFile failed at - (%u) %s\r\n"), dwLastError, err));
LocalFree(err);
}
}
// FUDGE
if (m_dwBytesRead == 0)
{
DWORD dwZeros = countZeros(buffer, buffer_size);
if (dwZeros < buffer_size)
{
m_dwBytesRead = buffer_size;
}
}
// END OF FUDGE
return bResult;
}
I repeatedly call the above function from another member function as follows
DWORD dwWritten;
while (!(Read(buffer, DOWNLOAD_BUFFER_SIZE) && m_dwBytesRead == 0))
{
WriteFile(m_hDownload, buffer, m_dwBytesRead, &dwWritten, NULL);
m_dwActualSize += dwWritten;
++m_dwChunks;
if (m_dwBytesRead > 0)
m_dwInactivity = 0;
else if (++m_dwInactivity > INACTIVITY_LIMIT)
return WDS_INACTIVITY;
}
Without the FUDGE, this function fails the first time through, and works correctly on subsequent calls. The error that I get on the first pass through this function call is
InternetReadFile failed at - (112) There is not enough space on the disk.
I don't understand why I should be getting a "not enough space on disk" error during a READ operation. I have checked that that the buffer is allocated, and available, and matches the expected size. In fact when I inspect the contents of the buffer, I find that it HAS been filled with the expected number of bytes, however the contents of the m_dwBytesRead variable is still set to 0.
As you can see, I have tried to code around this specific case by inspecting the contents of the buffer to see if it has been filled, and then fudging the m_dwBytesRead variable, but this is only a temporary work around to get me past this error, I really need to understand why this problem is occurring.
The consequences of this error (without my fudge), is that the data is thrown away, and I end up with a file that is missing the first block but otherwise fully correct. Consequently MD5 checks fail, and I am missing the first part of the file.
I just happen to know that the file will always be larger than the block size that I am using, so my fudge will work, but I don't like having these horrible workarounds in the code when they shouldn't be needed.
If anyone can shed any light upon what is causing the problem, it would be greatly appreciated.
I'm using Visual Studio 2013 C++ (native Windows app, not MFC), the target is 32-bit and Unicode, running on Windows Compact 2013.
Many thanks,
Andrew

Is the machine actually running out of disk space? InternetReadFile will write to disk behind your back by default:
To ensure all data is retrieved, an application must continue to call the InternetReadFile function until the function returns TRUE and the lpdwNumberOfBytesRead parameter equals zero. This is especially important if the requested data is written to the cache, because otherwise the cache will not be properly updated and the file downloaded will not be committed to the cache. Note that caching happens automatically unless the original request to open the data stream set the INTERNET_FLAG_NO_CACHE_WRITE flag.

What is the fastest way to read a file in disk in c++?

I am writing a program to check whether a file is PE file or not. For that, I need to read only the file headers of files(which I guess do not occupy more than first 1024 bytes of a file).
I tried using creatfile() + readfile() combination which turns out be slower because I am iterating through all the files in system drive. It is taking 15-20 minutes just to iterate through them.
Can you please tell some alternate approach to open and read the files to make it faster?
Note : Please note that I do NOT need to read the file in whole. I just need to read the initial part of the file -- DOS header, PE header etc which I guess do not occupy more than first 512 bytes of the file.
Here is my code :
bool IsPEFile(const String filePath)
{
HANDLE hFile = CreateFile(filePath.c_str(),
GENERIC_READ,
FILE_SHARE_READ | FILE_SHARE_WRITE,
NULL,
OPEN_EXISTING,
FILE_ATTRIBUTE_NORMAL,
NULL);
DWORD dwBytesRead = 0;
const DWORD CHUNK_SIZE = 2048;
BYTE szBuffer[CHUNK_SIZE] = {0};
LONGLONG size;
LARGE_INTEGER li = {0};
if (hFile != INVALID_HANDLE_VALUE)
{
if(GetFileSizeEx(hFile, &li) && li.QuadPart > 0)
{
size = li.QuadPart;
ReadFile(hFile, szBuffer, CHUNK_SIZE, &dwBytesRead, NULL);
if(dwBytesRead > 0 && (WORDPTR(szBuffer[0]) == ('M' << 8) + 'Z' || WORDPTR(szBuffer[0]) == ('Z' << 8) + 'M'))
{
LONGLONG ne_pe_header = DWORDPTR(szBuffer[0x3c]);
WORD signature = 0;
if(ne_pe_header <= dwBytesRead-2)
{
signature = WORDPTR(szBuffer[ne_pe_header]);
}
else if (ne_pe_header < size )
{
SetFilePointer(hFile, ne_pe_header, NULL, FILE_BEGIN);
ReadFile(hFile, &signature, sizeof(signature), &dwBytesRead, NULL);
if (dwBytesRead != sizeof(signature))
{
return false;
}
}
if(signature == 0x4550) // PE file
{
return true;
}
}
}
CloseHandle(hFile);
}
return false;
}
Thanks in advance.

I think you're hitting the inherent limitations of mechanical hard disk drives. You didn't mention whether you're using a HDD or a solid-state disk, but I assume a HDD given that your file accesses are slow.
HDDs can read data at about 100 MB/s sequentially, but seek time is a bit over 10 ms. This means that if you seek to a certain location (10 ms), you might as well read a megabyte of data (another 10 ms). This also means that you can access only less than 100 files per second.
So, in your case it doesn't matter much whether you're reading the first 512 bytes of a file or the first hundred kilobytes of a file.
Hardware is cheap, programmer time is expensive. Your best bet is to purchase a solid-state disk drive if your file accesses are too slow. I predict that eventually all computers will have solid-state disk drives.
Note: if the bottleneck is the HDD, there is nothing you can do about it other than to replace the HDD with better technology. Practically all file access mechanisms are equally slow. The only thing you can do about it is to read only the initial part of a file if the file is really really large such as multiple megabytes. But based on your code example you're already doing that.

For faster file IO, you need to use CreateFile and ReadFile APIs of Win32.
If you want to speed up, you can use file buffering and make file non-blocking by using overlapped IO or IOCP.
See this example for help: https://msdn.microsoft.com/en-us/library/windows/desktop/bb540534%28v=vs.85%29.aspx
And I think that FILE and fstream of C and C++ respectively are not faster than Win32.

Reading data off of a cluster

I need help reading data off of the last cluster of a file using CreateFile() and then using ReadFile(). First I'm stuck with a zero result for my ReadFile() because I think I have incorrect permissions set up in CreateFile().
/**********CreateFile for volume ********/
HANDLE hDevice = INVALID_HANDLE_VALUE;
hDevice = CreateFile(L"\\\\.\\C:",
0,
FILE_SHARE_READ |
FILE_SHARE_WRITE,
NULL,
OPEN_EXISTING,
0,
NULL);
if (hDevice == INVALID_HANDLE_VALUE)
{
wcout << "error at hDevice at CreateFile "<< endl;
system("pause");
}
/******* Read file from the volume *********/
DWORD nRead;
TCHAR buff[4096];
if (BOOL fileFromVol = ReadFile(
hDevice,
buff,
4096,
&nRead,
NULL
) == 0) {
cout << "Error with fileFromVol" << "\n\n";
system("pause");
}
Next, I have all the cluster information and file information I need (file size, last cluster location of the file,# of clusters on disk, cluster size,etc). How do I set the pointer on the volume to start at a specfied cluster location so I can read/write data from it?

The main problem is that you specify 0 for dwDesiredAccess. In order to read the data you should specify FILE_READ_DATA.
On top of that I seriously question the use of TCHAR. That's appropriate for text when you need to support Windows 9x. On top of not needing to support Windows 9x, the data is not text. Your buffer should be of type unsigned char.
Obviously you need the buffer to be a multiple of the cluster size. You've hard coded 4096, but the real code should surely query the cluster size.
When either of these API calls fail, they indicate a failure reason in the last error value. You can obtain that by calling GetLastError. When your ReadFile fails it will return ERROR_ACCESS_DENIED.
You can seek in the volume by calling SetFilePointerEx. Again, you will need to seek to multiples of the cluster size.
LARGE_INTEGER dist;
dist.QuadPart = ClusterNum * ClusterSize;
BOOL res = SetFilePointerEx(hFile, dist, nullptr, FILE_BEGIN);
if (!res)
// handle error
If you are reading sequentially that there's no need to set the file pointer. The call to ReadFile will advance it automatically.

When doing random-access I/O, just don't mess with the file pointer stored in the file handle at all. Instead, use an OVERLAPPED structure and specify the location for each and every I/O operation.
This works even for synchronous I/O (if the file is opened without FILE_FLAG_OVERLAPPED).
Of course, as David mentioned you will get ERROR_ACCESS_DENIED if you perform operations using a file handle opened without sufficient access.

Read bytes of hard drive

Using the hex editor HxDen one can read (and edit) the bytes on the hard drive or a USB key or the RAM. That is, one can read/change the first byte on the hard disk.
I understand how to read the bytes from a file using C++, but I was wondering how one might do this for the hard disk.
To make it simple, given a positive integer n, how can I read byte number n on the hard drive using C++? (I would like to do C++, but if there is an easier way, I would like to hear about that.)
I am using MinGW on Windows 7 if that matters.

It is documented in the MSDN Library article for CreateFile, section "Physical Disks and Volumes". This code worked well to directly read the C: drive:
HANDLE hdisk = CreateFile(L"\\\\.\\C:",
GENERIC_READ,
FILE_SHARE_READ | FILE_SHARE_WRITE,
nullptr,
OPEN_EXISTING,
0, NULL);
if (hdisk == INVALID_HANDLE_VALUE) {
int err = GetLastError();
// report error...
return -err;
}
LARGE_INTEGER position = { 0 };
BOOL ok = SetFilePointerEx(hdisk, position, nullptr, FILE_BEGIN);
assert(ok);
BYTE buf[65536];
DWORD read;
ok = ReadFile(hdisk, buf, 65536, &read, nullptr);
assert(ok);
// etc..
Admin privileges are required, you must run your program elevated on Win7 or you'll get error 5 (Access denied).

Faster method for exporting embedded data

For some reasons, i'm using the method described here: http://geekswithblogs.net/TechTwaddle/archive/2009/10/16/how-to-embed-an-exe-inside-another-exe-as-a.aspx
It starts off from the first byte of the embedded file and goes through 4.234.925 bytes one by one! It takes approximately 40 seconds to finish.
Is there any other methods for copying an embedded file to the hard-disk? (I maybe wrong here but i think the embedded file is read from the memory)
Thanks.

Once you know the location and size of the embedded exe , then you can do it in one write.
LPBYTE pbExtract; // the pointer to the data to extract
UINT cbExtract; // the size of the data to extract.
HANDLE hf;
hf = CreateFile("filename.exe", // file name
GENERIC_WRITE, // open for writing
0, // no share
NULL, // no security
CREATE_ALWAYS, // overwrite existing
FILE_ATTRIBUTE_NORMAL, // normal file
NULL); // no template
if (INVALID_HANDLE_VALUE != hf)
{
DWORD cbWrote;
WriteFile(hf, pbExtract, cbExtract, &cbWrote, NULL);
CloseHandle(hf);
}

As the man says, write more of the file (or the whole thing) per WriteFile call. A WriteFile call per byte is going to be ridiculously slow yes.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Reading large binary files in small parts - c++

You can map parts of the file into the address space of your process using CreateFile, CreateFileMapping and MapViewOfFile.

You can read the file sequentially without any problems. The limitations is that fseek uses a long parameter for the offset when you want to seek. If you don't reposition in the file, or the offset is always less than 2GB, there is no problem.

ReadFile will handle files larger than 2GB, maybe you can rephrase your question so we can help you figure out the problems you are having with that.

Related

InternetReadFile filling buffer, but returning zero bytes read

What is the fastest way to read a file in disk in c++?

Reading data off of a cluster

Read bytes of hard drive

Faster method for exporting embedded data

Categories

Resources