Searching for structures in a continuous, unstructured file stream - c++

I am trying to figure out a (hopefully easy) way to read a large, unstructured file without bumping into the edge of a buffer. An example is helpful here.
Imagine you are trying to do some data-recovery of a 16GB flash-drive and have saved a dump of the drive to a 16GB file. You want to scan through the image, looking for certain items of interest. If the file were smaller, you could read the entire thing into a memory buffer (let’s say 1MB) and do a simple scan through the buffer. However, because it is too big to read in all at once, you need to read it in chunks. The problem is that an item of interest may not be perfectly aligned so as to fall within a single 1MB buffer. In other words, it may end up straddling the edge of the buffer so that it starts at the end of the buffer during one read, and ends in the next one (or even further).
At one time in the past, I dealt with this by using two buffers and copying the second one to the first one to create a sort of sliding window, however I imagine that this should be a common enough scenario that there are better, existing solutions. I looked into memory-mapped files, thinking that they let you read the file by simply increasing the array index/pointer, but I ended up in the exact same situation as before due to the limit of the map view size. I tried looking for some practical examples of using MapViewOfFile with offsets, but all I could find were contrived examples that skipped that.
How is this situation normally handled?

If you are running in a 64 bit environment, I would just use memory mapped files. There is no (reasonable) memory limit for a process. You can read the file in, even jump around, and the OS will swap memory to and from disk.
Here's some basic information:
http://msdn.microsoft.com/en-us/library/ms810613.aspx
And an example of a file viewer here:
http://www.catch22.net/tuts/memory-techniques-part-1
This case works on a 2.8GB file in x64, but fails in win32 because it cannot allocate more than 2GB per process. It is very fast since it touches only the first and last byte in the pBuf array. Modifying the method to traverse the buffer and count the number of 'zero' bytes works as expected. You can watch the memory footprint go up as it does it but that memory is only virtually allocated.
#include "stdafx.h"
#include <string>
#include <Windows.h>
TCHAR szName[] = TEXT( pathToFile );
int _tmain(int argc, _TCHAR* argv[])
{
HANDLE hMapFile;
char* pBuf;
HANDLE file = CreateFile( szName, GENERIC_READ, FILE_SHARE_READ, 0, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, 0);
if ( file == NULL )
{
_tprintf(TEXT("Could not open file object (%d).\n"),
GetLastError());
return 1;
}
unsigned int length = GetFileSize(file, 0);
printf( "Length = %u\n", length );
hMapFile = CreateFileMapping( file, 0, PAGE_READONLY, 0, 0, 0 );
if (hMapFile == NULL)
{
_tprintf(TEXT("Could not create file mapping object (%d).\n"), GetLastError());
return 1;
}
pBuf = (char*) MapViewOfFile(hMapFile, FILE_MAP_READ, 0,0, length);
if (pBuf == NULL)
{
_tprintf(TEXT("Could not map view of file (%d).\n"), GetLastError());
CloseHandle(hMapFile);
return 1;
}
printf("First Byte: 0x%02x\n", pBuf[0] );
printf("Last Byte: 0x%02x\n", pBuf[length-1] );
UnmapViewOfFile(pBuf);
CloseHandle(hMapFile);
return 0;
}

Related

File Mapping,How to open a file(txt) from a specific location

I have a big .txt file (over 1gb). While searching a way to open it fast I found mapping.
I managed to use CreateFile(), then I made a char buffer[] and finally put the file contents in the buffer with ReadFile(). The problem is that the file is too big, so I can't load it all at once into the buffer, because I can't make an array that big.
I think the solution would be to open and close the file at specified locations in the .txt file and get a few of the file contents each time. The only source I found explaining mapping was on MSDN but I can't find out how to do it.
So in the end, how do I read a big file with a mapping?
HANDLE my_File = CreateFileA("words.txt", GENERIC_READ, 0, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
if (my_File == INVALID_HANDLE_VALUE)
{
cout << "Failed to open file" << endl;
return 0;
}
constexpr size_t BUFFSIZE = 1000000;
char buffer[BUFFSIZE];
DWORD dwBytesToRead = BUFFSIZE - 1;
DWORD dwBytesRead = 0;
BOOL my_Bool = ReadFile(my_File,(void*)buffer, dwBytesToRead, &dwBytesRead, NULL);
if (dwBytesRead > 0)
{
buffer[dwBytesRead] = '\0';
cout << "FILE IS: " << buffer << endl;
}
CloseHandle(my_File);
I think you are confused. The whole purpose of mapping part or all of a file into memory is to avoid the need to buffer the data yourself. Instead, the OS takes care of that for you, allowing you to access the contents of the file via a pointer, just like you would any other in-memory data structure.
Only you can decide if that's the best solution for you. In a 32 bit app, 1GB is a lot of addressing space to find. In a 64 bit app there is no such problem. As mentioned in the comments, reading the file in chunks into a smaller buffer can be a better bet, especially if you want to process it sequentially.
For some example code on how to memory map a file, see:
How to CreateFileMapping in C++?

Difference Between FileMapping and Istream Binary

I have two code samples, the first is the following:
//THIS CODE READS IN THE CALC.EXE BINARY INTO MEMORY BUFFER USING ISTREAM
ifstream in("notepad.exe", std::ios::binary | std::ios::ate);
int size = in.tellg();
char* buffer = new char[size];
ifstream input("calc.exe", std::ios::binary);
input.read(buffer, size);
This is the second:
//THIS CODE GETS FILE MAPPING IMAGE OF SAME BINARY
handle = CreateFile("notepad.exe", GENERIC_READ, FILE_SHARE_READ, NULL, OPEN_EXISTING, 0, NULL);
mappinghandle = CreateFileMapping(hFile, NULL, PAGE_READONLY, 0, 0, NULL);
image = (DWORD) MapViewOfFile(hMap, FILE_MAP_READ, 0, 0, 0);
My question is, what exactly is the difference between these two methods? If we are ignoring the sizing issue that is better handled by file mapping, are these two objects returned essentially the same? Will not the image variable point to essentially the same thing as the buffer variable - this being an image of the binary executable in memory? What are all the differences between the two?
The method using std::ifstream actually copies the data of the file into RAM, when input.read() is called whereas MapViewOfFile doesn't access the file data.
MapViewOfFile() returns a pointer, but the data will only actually be read from disk when you access the virtual memory that the pointer points to.
// This creates just a "view" of the file but doesn't read data from the file.
const char *buffer = reinterpret_cast<const char*>( MapViewOfFile(hMap, FILE_MAP_READ, 0, 0, 0) );
// Only at this point the file actually gets read. The virtual memory manager now
// reads a page of 4KiB into memory.
char value = buffer[ 10 ];
To further illustrate the difference, say we read a byte at offset 12345 from the memory-mapped file:
char value = buffer[ 12345 ];
Now the virtual memory manager won't read all the data up to this offset, but instead only map the page that is closest to that offset into memory. That would be the page located between offset 12288 (=4096*3) and 16384 (=4096*4).
The first one reads from the file into a buffer, the buffer is independent of the original file.
The second one is accessing the file itself, you won't be able to delete the file while a mapping exists, and although you can't make changes because you have a read-only mapping, changes made to that file outside your program can be visible in your mapping.

Using WriteFile to fill up a cluster

I want to use Writefile to fill up then end of every file until it reaches the end of its last cluster. Then I want to delete what I wrote and repeat the process(attempting to get rid data that might have been there).
I have a 2 issues:
WriteFile gives me an error: ERROR_INVALID_PARAMETER
Depending on the type of file, WriteFile() gives me different results
So for the first issue I realized that the parameter nNumberOfBytesToWrite in the WriteFile() has to be a multiple of bytes per sector(my case is 512 bytes). Is this a limitation of the function or am I doing something wrong?
In my second issue, I'm using two dummy files(.txt and .html) on an external hard drive to write random data to. In the case of the .txt file, the data is written to the end of the file which is what I need. However, the .html file just writes to the beginning of the file and replaces any data that was already there.
Here are some code snippets relevant to my issue:
hFile = CreateFile(result,
GENERIC_READ | GENERIC_WRITE |FILE_READ_ATTRIBUTES,
FILE_SHARE_READ | FILE_SHARE_WRITE,
0,
OPEN_EXISTING,
FILE_FLAG_NO_BUFFERING,
0);
if (hFile == INVALID_HANDLE_VALUE) {
cout << "File does not exist" << endl;
CloseHandle(hFile);
}
DWORD dwBytesWritten;
char * wfileBuff = new char[512];
memset (wfileBuff,'0',512);
returnz = SetFilePointer(hFile, 0,NULL,FILE_END);
if(returnz ==0){
cout<<"Error: "<<GetLastError()<<endl;
};
LockFile(hFile, returnz, 0, 512, 0)
returnz =WriteFile(hFile, wfileBuff, 512, &dwBytesWritten, NULL);
if(returnz ==0){
cout<<"Error: "<<GetLastError()<<endl;
}
UnlockFile(hFile, returnz, 0, 512, 0);
cout<<dwBytesWritten<<endl<<endl;
I am using static numbers at the moment just to test out the functions. Is there anyway I can always write to the the end of the file no matter what type of file? I also tried SetFilePointer(hFile, 0,(fileSize - slackSpace + 1),FILE_BEGIN); but that didn't work.
You need to heed the information in the documentation concerning FILE_FLAG_NO_BUFFERING. Specifically this section:
As previously discussed, an application must meet certain requirements
when working with files opened with FILE_FLAG_NO_BUFFERING. The
following specifics apply:
File access sizes, including the optional file offset in the OVERLAPPED structure, if specified, must be for a number of bytes that
is an integer multiple of the volume sector size. For example, if the
sector size is 512 bytes, an application can request reads and writes
of 512, 1,024, 1,536, or 2,048 bytes, but not of 335, 981, or 7,171
bytes.
File access buffer addresses for read and write operations should be physical sector-aligned, which means aligned on addresses in memory
that are integer multiples of the volume's physical sector size.
Depending on the disk, this requirement may not be enforced.

Reading data off of a cluster

I need help reading data off of the last cluster of a file using CreateFile() and then using ReadFile(). First I'm stuck with a zero result for my ReadFile() because I think I have incorrect permissions set up in CreateFile().
/**********CreateFile for volume ********/
HANDLE hDevice = INVALID_HANDLE_VALUE;
hDevice = CreateFile(L"\\\\.\\C:",
0,
FILE_SHARE_READ |
FILE_SHARE_WRITE,
NULL,
OPEN_EXISTING,
0,
NULL);
if (hDevice == INVALID_HANDLE_VALUE)
{
wcout << "error at hDevice at CreateFile "<< endl;
system("pause");
}
/******* Read file from the volume *********/
DWORD nRead;
TCHAR buff[4096];
if (BOOL fileFromVol = ReadFile(
hDevice,
buff,
4096,
&nRead,
NULL
) == 0) {
cout << "Error with fileFromVol" << "\n\n";
system("pause");
}
Next, I have all the cluster information and file information I need (file size, last cluster location of the file,# of clusters on disk, cluster size,etc). How do I set the pointer on the volume to start at a specfied cluster location so I can read/write data from it?
The main problem is that you specify 0 for dwDesiredAccess. In order to read the data you should specify FILE_READ_DATA.
On top of that I seriously question the use of TCHAR. That's appropriate for text when you need to support Windows 9x. On top of not needing to support Windows 9x, the data is not text. Your buffer should be of type unsigned char.
Obviously you need the buffer to be a multiple of the cluster size. You've hard coded 4096, but the real code should surely query the cluster size.
When either of these API calls fail, they indicate a failure reason in the last error value. You can obtain that by calling GetLastError. When your ReadFile fails it will return ERROR_ACCESS_DENIED.
You can seek in the volume by calling SetFilePointerEx. Again, you will need to seek to multiples of the cluster size.
LARGE_INTEGER dist;
dist.QuadPart = ClusterNum * ClusterSize;
BOOL res = SetFilePointerEx(hFile, dist, nullptr, FILE_BEGIN);
if (!res)
// handle error
If you are reading sequentially that there's no need to set the file pointer. The call to ReadFile will advance it automatically.
When doing random-access I/O, just don't mess with the file pointer stored in the file handle at all. Instead, use an OVERLAPPED structure and specify the location for each and every I/O operation.
This works even for synchronous I/O (if the file is opened without FILE_FLAG_OVERLAPPED).
Of course, as David mentioned you will get ERROR_ACCESS_DENIED if you perform operations using a file handle opened without sufficient access.

Faster way to get File Size information C++

I have a function to get a FileSize of a file. I am running this on WinCE. Here is my current code which seems particularily slow
int Directory::GetFileSize(const std::string &filepath)
{
int filesize = -1;
#ifdef linux
struct stat fileStats;
if(stat(filepath.c_str(), &fileStats) != -1)
filesize = fileStats.st_size;
#else
std::wstring widePath;
Unicode::AnsiToUnicode(widePath, filepath);
HANDLE hFile = CreateFile(widePath.c_str(), 0, FILE_SHARE_READ | FILE_SHARE_WRITE, 0, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, 0);
if (hFile > 0)
{
filesize = ::GetFileSize( hFile, NULL);
}
CloseHandle(hFile);
#endif
return filesize;
}
At least for Windows, I think I'd use something like this:
__int64 Directory::GetFileSize(std::wstring const &path) {
WIN32_FIND_DATAW data;
HANDLE h = FindFirstFileW(path.c_str(), &data);
if (h == INVALID_HANDLE_VALUE)
return -1;
FindClose(h);
return data.nFileSizeLow | (__int64)data.nFileSizeHigh << 32;
}
If the compiler you're using supports it, you might want to use long long instead of __int64. You probably do not want to use int though, as that will only work correctly for files up to 2 gigabytes, and files larger than that are now pretty common (though perhaps not so common on a WinCE device).
I'd expect this to be faster than most other methods though. It doesn't require opening the file itself at all, just finding the file's directory entry (or, in the case of something like NTFS, its master file table entry).
Your solution is already rather fast to query the size of a file.
Under Windows, at least for NTFS and FAT, the file system driver will keep the file size in the cache, so it is rather fast to query it. The most time-consuming work involved is switching from user-mode to kernel-mode, rather than the file system driver's processing.
If you want to make it even faster, you have to use your own cache policy in user-mode, e.g. a special hash table, to avoid switching from user-mode to kernel-mode. But I don't recommend you to do that, because you will gain little performance.
PS: You'd better avoid the statement Unicode::AnsiToUnicode(widePath, filepath); in your function body. This function is rather time-consuming.
Just an idea (I haven't tested it), but I would expect
GetFileAttributesEx to be fastest at the system level. It
avoids having to open the file, and logically, I would expect it
to be faster than FindFirstFile, since it doesn't have to
maintain any information for continuing the search.
You could roll your own but I don't see why your approach is slow:
int Get_Size( string path )
{
// #include <fstream>
FILE *pFile = NULL;
// get the file stream
fopen_s( &pFile, path.c_str(), "rb" );
// set the file pointer to end of file
fseek( pFile, 0, SEEK_END );
// get the file size
int Size = ftell( pFile );
// return the file pointer to begin of file if you want to read it
// rewind( pFile );
// close stream and release buffer
fclose( pFile );
return Size;
}