reliability of proc/statm for finding memory leak

reliability of proc/statm for finding memory leak - c++

I am trying to find a slow memory leak in a large application.
ps shows the VSZ growing slowly until the application crashes after running for 12-18 hours. Unfortunately, valgrind, leakcheak, etc have not been useful (Valgrind fails with illegal instruction).
Alternatively, I've been printing the contents of /proc/statm over time, and approximately every 10s I see the first field of statm (total program size) increase by 20-30 bytes.
I've tracked it down to one function but it doesn't make sense. The offending function reads a directory and performs a clear() on a std::set. What in the function would increase the memory footprint? And... why doesn't the memory reduce once the directory is closed?
Trace Output:
DvCfgProfileList::obtainSystemProfileList() PRE MEMORY USAGE: 27260 11440 7317 15 0 12977 0
DvCfgProfileList::obtainSystemProfileList() MID 1 MEMORY USAGE: 27296 11440 7317 15 0 13013 0
DvCfgProfileList::obtainSystemProfileList() MID 2 MEMORY USAGE: 27296 11443 7317 15 0 13013 0
DvCfgProfileList::obtainSystemProfileList POST MEMORY USAGE: 27288 11443 7317 15 0 13005 0
The Big Question
Can I rely on reading /proc/statm for an immediate reading of process memory? This Unix/Linux Posting says it is "updated on every access".
If true, then why does it indicate that obtainSystemProfileList() is leaking?
EDIT I
I added the link to the Unix/Linux post. So if reads of /proc/.../statm result in a direct and immediate kernel call, then is there some time delay in the kernel updating its own internal results? If indeed there is no memory leak in the code fragment, then what else explains the change in mem values across a few lines of code?
EDIT II
Would calling getrusage() provide a more immediate and accurate view of process memory use? (or does it just make the same, potentially delayed, kernel calls as reading /proc/.../statm ?
Kernel is 32-bit 3.10.80-1 if that makes any difference...
Code Fragment:
bool
DvCfgProfileList::obtainSystemProfileList()
{
TRACE(("DvCfgProfileList::obtainSystemProfileList() PRE "));
DvComUtil::printMemoryUsage();
DIR *pDir = opendir(SYSTEM_PROFILE_DIRECTORY);
if (pDir == 0)
{
mkdir(SYSTEM_PROFILE_DIRECTORY, S_IRWXU | S_IRWXG | S_IRWXO);
pDir = opendir(SYSTEM_PROFILE_DIRECTORY);
if (pDir == 0)
{
TRACE(("%s does not exist or cannot be created\n", SYSTEM_PROFILE_DIRECTORY));
return false;
}
}
TRACE(("DvCfgProfileList::obtainSystemProfileList() MID 1 "));
DvComUtil::printMemoryUsage();
mpProfileList->clearSystemProfileList(); // calls (std::set) mProfileList.clear()
TRACE(("DvCfgProfileList::obtainSystemProfileList() MID 2 "));
DvComUtil::printMemoryUsage();
struct dirent *pEntry;
while ((pEntry = readdir(pDir)) != 0)
{
if (!strcmp(pEntry->d_name, ".") || !strcmp(pEntry->d_name, ".."))
continue;
TRACE(("Profile name = %s\n", pEntry->d_name));
mpProfileList->addSystemProfile(std::string(pEntry->d_name));
}
closedir(pDir);
printf("DvCfgProfileList::obtainSystemProfileList POST ");
DvComUtil::printMemoryUsage();
return true;
}
/* static */ void
DvComUtil::printMemoryUsage()
{
char fname[256], line[256];
sprintf(fname, "/proc/%d/statm", getpid());
FILE *pFile = fopen(fname, "r");
if (!pFile)
return;
fgets(line, 255, pFile);
fclose(pFile);
printf("MEMORY USAGE: %s", line);
}

Related

VSZ and RSS kept increasing on AIX

I see an abnormal memory usage pattern while running my application program on AIX...
I have created a simple program to malloc and free replicate the same problem.
int main()
{
int *ptr_one;
// enter value as 0.
// I wanted few secs fetch the PID of this statndlone process
// and run 'ps -p <PID> -o "vsz rssize"'
long a;
scanf("%ld", &a);
for(;;)
{
if(a < 10000000) a = a + 100;
ptr_one = (int *)malloc(sizeof(int)*a);
if (ptr_one == 0){
printf("ERROR: Out of memory\n");
return 1;
}
*ptr_one = 25;
printf("%d\n", *ptr_one);
free(ptr_one);
}
return 0;
}
I have captured the memory usage of this program using the below command,
ps -p $1 -o "vsz rssize" | tail -1 >> out.txt
The graph tells the memory kept growing and not released.
Is this a sign of leak or this is a normal memory behavior on AIX?

It is fully correct that process'es memory usage size is not decreased: while malloc can request additional memory for the process, free never return it to the system. Instead, freeing memory is reused in future malloc calls.

What could cause a mutex to misbehave?

I've been busy the last couple of months debugging a rare crash caused somewhere within a very large proprietary C++ image processing library, compiled with GCC 4.7.2 for an ARM Cortex-A9 Linux target. Since a common symptom was glibc complaining about heap corruption, the first step was to employ a heap corruption checker to catch oob memory writes. I used the technique described in https://stackoverflow.com/a/17850402/3779334 to divert all calls to free/malloc to my own function, padding every allocated chunk of memory with some amount of known data to catch out-of-bounds writes - but found nothing, even when padding with as much as 1 KB before and after every single allocated block (there are hundreds of thousands of allocated blocks due to intensive use of STL containers, so I can't enlarge the padding further, plus I assume any write more than 1KB out of bounds would eventually trigger a segfault anyway). This bounds checker has found other problems in the past so I don't doubt its functionality.
(Before anyone says 'Valgrind', yes, I have tried that too with no results either.)
Now, my memory bounds checker also has a feature where it prepends every allocated block with a data struct. These structs are all linked in one long linked list, to allow me to occasionally go over all allocations and test memory integrity. For some reason, even though all manipulations of this list are mutex protected, the list was getting corrupted. When investigating the issue, it began to seem like the mutex itself was occasionally failing to do its job. Here is the pseudocode:
pthread_mutex_t alloc_mutex;
static bool boolmutex; // set to false during init. volatile has no effect.
void malloc_wrapper() {
// ...
pthread_mutex_lock(&alloc_mutex);
if (boolmutex) {
printf("mutex misbehaving\n");
__THROW_ERROR__; // this happens!
}
boolmutex = true;
// manipulate linked list here
boolmutex = false;
pthread_mutex_unlock(&alloc_mutex);
// ...
}
The code commented with "this happens!" is occasionally reached, even though this seems impossible. My first theory was that the mutex data structure was being overwritten. I placed the mutex within a struct, with large arrays before and after it, but when this problem occurred the arrays were untouched so nothing seems to be overwritten.
So.. What kind of corruption could possibly cause this to happen, and how would I find and fix the cause?
A few more notes. The test program uses 3-4 threads for processing. Running with less threads seems to make the corruptions less common, but not disappear. The test runs for about 20 seconds each time and completes successfully in the vast majority of cases (I can have 10 units repeating the test, with the first failure occurring after 5 minutes to several hours). When the problem occurs it is quite late in the test (say, 15 seconds in), so this isn't a bad initialization issue. The memory bounds checker never catches actual out of bounds writes but glibc still occasionally fails with a corrupted heap error (Can such an error be caused by something other than an oob write?). Each failure generates a core dump with plenty of trace information; there is no pattern I can see in these dumps, no particular section of code that shows up more than others. This problem seems very specific to a particular family of algorithms and does not happen in other algorithms, so I'm quite certain this isn't a sporadic hardware or memory error. I have done many more tests to check for oob heap accesses which I don't want to list to keep this post from getting any longer.
Thanks in advance for any help!

Thanks to all commenters. I've tried nearly all suggestions with no results, when I finally decided to write a simple memory allocation stress test - one that would run a thread on each of the CPU cores (my unit is a Freescale i.MX6 quad core SoC), each allocating and freeing memory in random order at high speed. The test crashed with a glibc memory corruption error within minutes or a few hours at most.
Updating the kernel from 3.0.35 to 3.0.101 solved the problem; both the stress test and the image processing algorithm now run overnight without failing. The problem does not reproduce on Intel machines with the same kernel version, so the problem is specific either to ARM in general or perhaps to some patch Freescale included with the specific BSP version that included kernel 3.0.35.
For those curious, attached is the stress test source code. Set NUM_THREADS to the number of CPU cores and build with:
<cross-compiler-prefix>g++ -O3 test_heap.cpp -lpthread -o test_heap
I hope this information helps someone. Cheers :)
// Multithreaded heap stress test. By Itay Chamiel 20151012.
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include <assert.h>
#include <pthread.h>
#include <sys/time.h>
#define NUM_THREADS 4 // set to number of CPU cores
#define ALIVE_INDICATOR NUM_THREADS
// Each thread constantly allocates and frees memory. In each iteration of the infinite loop, decide at random whether to
// allocate or free a block of memory. A list of 500-1000 allocated blocks is maintained by each thread. When memory is allocated
// it is added to this list; when freeing, a random block is selected from this list, freed and removed from the list.
void* thr(void* arg) {
int* alive_flag = (int*)arg;
int thread_id = *alive_flag; // this is a number between 0 and (NUM_THREADS-1) given by main()
int cnt = 0;
timeval t_pre, t_post;
gettimeofday(&t_pre, NULL);
const int ALLOCATE=1, FREE=0;
const unsigned int MINSIZE=500, MAXSIZE=1000;
const int MAX_ALLOC=10000;
char* membufs[MAXSIZE];
unsigned int membufs_size = 0;
int num_allocs = 0, num_frees = 0;
while(1)
{
int action;
// Decide whether to allocate or free a memory block.
// if we have less than MINSIZE buffers, allocate.
if (membufs_size < MINSIZE) action = ALLOCATE;
// if we have MAXSIZE, free.
else if (membufs_size >= MAXSIZE) action = FREE;
// else, decide randomly.
else {
action = ((rand() & 0x1)? ALLOCATE : FREE);
}
if (action == ALLOCATE) {
// choose size to allocate, from 1 to MAX_ALLOC bytes
size_t size = (rand() % MAX_ALLOC) + 1;
// allocate and fill memory
char* buf = (char*)malloc(size);
memset(buf, 0x77, size);
// add buffer to list
membufs[membufs_size] = buf;
membufs_size++;
assert(membufs_size <= MAXSIZE);
num_allocs++;
}
else { // action == FREE
// choose a random buffer to free
size_t pos = rand() % membufs_size;
assert (pos < membufs_size);
// free and remove from list by replacing entry with last member
free(membufs[pos]);
membufs[pos] = membufs[membufs_size-1];
membufs_size--;
assert(membufs_size >= 0);
num_frees++;
}
// once in 10 seconds print a status update
gettimeofday(&t_post, NULL);
if (t_post.tv_sec - t_pre.tv_sec >= 10) {
printf("Thread %d [%d] - %d allocs %d frees. Alloced blocks %u.\n", thread_id, cnt++, num_allocs, num_frees, membufs_size);
gettimeofday(&t_pre, NULL);
}
// indicate alive to main thread
*alive_flag = ALIVE_INDICATOR;
}
return NULL;
}
int main()
{
int alive_flag[NUM_THREADS];
printf("Memory allocation stress test running on %d threads.\n", NUM_THREADS);
// start a thread for each core
for (int i=0; i<NUM_THREADS; i++) {
alive_flag[i] = i; // tell each thread its ID.
pthread_t th;
int ret = pthread_create(&th, NULL, thr, &alive_flag[i]);
assert(ret == 0);
}
while(1) {
sleep(10);
// check that all threads are alive
bool ok = true;
for (int i=0; i<NUM_THREADS; i++) {
if (alive_flag[i] != ALIVE_INDICATOR)
{
printf("Thread %d is not responding\n", i);
ok = false;
}
}
assert(ok);
for (int i=0; i<NUM_THREADS; i++)
alive_flag[i] = 0;
}
return 0;
}

HeapWalk not working as expected in Release mode

So I used this example of the HeapWalk function to implement it into my app. I played around with it a bit and saw that when I added
HANDLE d = HeapAlloc(hHeap, 0, sizeof(int));
int* f = new(d) int;
after creating the heap then some new output would be logged:
Allocated block Data portion begins at: 0X037307E0
Size: 4 bytes
Overhead: 28 bytes
Region index: 0
So seeing this I thought I could check Entry.wFlags to see if it was set as PROCESS_HEAP_ENTRY_BUSY to keep a track of how much allocated memory I'm using on the heap. So I have:
HeapLock(heap);
int totalUsedSpace = 0, totalSize = 0, largestFreeSpace = 0, largestCounter = 0;
PROCESS_HEAP_ENTRY entry;
entry.lpData = NULL;
while (HeapWalk(heap, &entry) != FALSE)
{
int entrySize = entry.cbData + entry.cbOverhead;
if ((entry.wFlags & PROCESS_HEAP_ENTRY_BUSY) != 0)
{
// We have allocated memory in this block
totalUsedSpace += entrySize;
largestCounter = 0;
}
else
{
// We do not have allocated memory in this block
largestCounter += entrySize;
if (largestCounter > largestFreeSpace)
{
// Save this value as we've found a bigger space
largestFreeSpace = largestCounter;
}
}
// Keep a track of the total size of this heap
totalSize += entrySize;
}
HeapUnlock(heap);
And this appears to work when built in debug mode (totalSize and totalUsedSpace are different values). However, when I run it in Release mode totalUsedSpace is always 0.
I stepped through it with the debugger while in Release mode and for each heap it loops three times and I get the following flags in entry.wFlags from calling HeapWalk:
1 (PROCESS_HEAP_REGION)
0
2 (PROCESS_HEAP_UNCOMMITTED_RANGE)
It then exits the while loop and GetLastError() returns ERROR_NO_MORE_ITEMS as expected.
From here I found that a flag value of 0 is "the committed block which is free, i.e. not being allocated or not being used as control structure."
Does anyone know why it does not work as intended when built in Release mode? I don't have much experience of how memory is handled by the computer, so I'm not sure where the error might be coming from. Searching on Google didn't come up with anything so hopefully someone here knows.
UPDATE: I'm still looking into this myself and if I monitor the app using vmmap I can see that the process has 9 heaps, but when calling GetProcessHeaps it returns that there are 22 heaps. Also, none of the heap handles it returns matches to the return value of GetProcessHeap() or _get_heap_handle(). It seems like GetProcessHeaps is not behaving as expected. Here is the code to get the list of heaps:
// Count how many heaps there are and allocate enough space for them
DWORD numHeaps = GetProcessHeaps(0, NULL);
HANDLE* handles = new HANDLE[numHeaps];
// Get a handle to known heaps for us to compare against
HANDLE defaultHeap = GetProcessHeap();
HANDLE crtHeap = (HANDLE)_get_heap_handle();
// Get a list of handles to all the heaps
DWORD retVal = GetProcessHeaps(numHeaps, handles);
And retVal is the same value as numHeaps, which indicates that there was no error.

Application Verifier had been set up previously to do a full page heap verifying of my executable and was interfering with the heaps returned by GetProcessHeaps. I'd forgotten about it being set up as it was done for a different issue several days ago and then closed without clearing the tests. It wasn't happening in debug build because the application builds to a different file name for debug builds.
We managed to detect this by adding a breakpoint and looking at the callstack of the thread. We could see the AV DLL had been injected in and that let us know where to look.

Why this app doesn't consume as much memory as expected

I wrote a simple application to test memory consumption. In this test application, I created four processes to continually consume memory, those processes won't release the memory unless the process exits.
I expected this test application to consume the most memory of RAM and cause the other application to slow down or crash. But the result is not the same as expected. Below is the code:
#include <stdio.h>
#include <unistd.h>
#include <list>
#include <vector>
using namespace std;
unsigned short calcrc(unsigned char *ptr, int count)
{
unsigned short crc;
unsigned char i;
//high cpu-consumption code
//implements the CRC algorithm
//CRC is Cyclic Redundancy Code
}
void* ForkChild(void* param){
vector<unsigned char*> MemoryVector;
pid_t PID = fork();
if (PID > 0){
const int TEN_MEGA = 10 * 10 * 1024 * 1024;
unsigned char* buffer = NULL;
while(1){
buffer = NULL;
buffer = new unsigned char [TEN_MEGA];
if (buffer){
try{
calcrc(buffer, TEN_MEGA);
MemoryVector.push_back(buffer);
} catch(...){
printf("An error was throwed, but caught by our app!\n");
delete [] buffer;
buffer = NULL;
}
}
else{
printf("no memory to allocate!\n");
try{
if (MemoryVector.size()){
buffer = MemoryVector[0];
calcrc(buffer, TEN_MEGA);
buffer = NULL;
} else {
printf("no memory ever allocated for this Process!\n");
continue;
}
} catch(...){
printf("An error was throwed -- branch 2,"
"but caught by our app!\n");
buffer = NULL;
}
}
} //while(1)
} else if (PID == 0){
} else {
perror("fork error");
}
return NULL;
}
int main(){
int children = 4;
while(--children >= 0){
ForkChild(NULL);
};
while(1) sleep(1);
printf("exiting main process\n");
return 0;
}
TOP command
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2775 steve 20 0 1503m 508 312 R 99.5 0.0 1:00.46 test
2777 steve 20 0 1503m 508 312 R 96.9 0.0 1:00.54 test
2774 steve 20 0 1503m 904 708 R 96.6 0.0 0:59.92 test
2776 steve 20 0 1503m 508 312 R 96.2 0.0 1:00.57 test
Though CPU is high, but memory percent remains 0.0. How can it be possible??
Free command
free shared buffers cached
Mem: 3083796 0 55996 428296
Free memory is more than 3G out of 4G RAM.
Does there anybody know why this test app just doesn't work as expected?

Linux uses optimistic memory allocation: it will not physically allocate a page of memory until that page is actually written to. For that reason, you can allocate much more memory than what is available, without increasing memory consumption by the system.
If you want to force the system to allocate (commit) a physical page , then you have to write to it.
The following line does not issue any write, as it is default-initialization of unsigned char, which is a no-op:
buffer = new unsigned char [TEN_MEGA];
If you want to force a commit, use zero-initialization:
buffer = new unsigned char [TEN_MEGA]();

To make the comments into an answer:
Linux will not allocate memory pages for a process until it writes to them (copy-on-write).
Additionally, you are not writing to your buffer anywhere, as the default constructor for unsigned char does not perform any initializations, and new[] default-initializes all items.

fork() returns the PID in the parent, and 0 in the child. Your ForkChild as written will execute all the work in the parent, not the child.
And the standard new operator will never return null; it will throw if it fails to allocate memory (but due to overcommit it won't actually do that either in Linux). This means your test of buffer after the allocation is meaningless: it will always either take the first branch or never reach the test. If you want a null return, you need to write new (std::nothrow) .... Include <new> for that to work.

But your program is infact doing what you expected it to do. As an answer has pointed out (# Michael Foukarakis's answer), memory not used is not allocated. In your output of the top program, I noticed that the column virt had a large amount of memory on it for each process running your program. A little googling later, I saw what this was:
VIRT -- Virtual Memory Size (KiB). The total amount of virtual memory used by the task. It includes all code, data and shared libraries plus pages that have been swapped out and pages that have been mapped but not used.
So as you can see, your program does in fact generate memory for itself, but in the form of pages and stored as virtual memory. And I think that is a smart thing to do
A snippet from this wiki page
A page, memory page, or virtual page -- a fixed-length contiguous block of virtual memory, and it is the smallest unit of data for the following:
memory allocation performed by the operating system for a program; and
transfer between main memory and any other auxiliary store, such as a hard disk drive.
...Thus a program can address more (virtual) RAM than physically exists in the computer. Virtual memory is a scheme that gives users the illusion of working with a large block of contiguous memory space (perhaps even larger than real memory), when in actuality most of their work is on auxiliary storage (disk). Fixed-size blocks (pages) or variable-size blocks of the job are read into main memory as needed.
Sources:
http://www.computerhope.com/unix/top.htm
https://stackoverflow.com/a/18917909/2089675
http://en.wikipedia.org/wiki/Page_(computer_memory)

If you want to gobble up a lot of memory:
int mb = 0;
char* buffer;
while (1) {
buffer = malloc(1024*1024);
memset(buffer, 0, 1024*1024);
mb++;
}
I used something like this to make sure the file buffer cache was empty when taking some file I/O timing measurements.
As other answers have already mentioned, your code doesn't ever write to the buffer after allocating it. Here memset is used to write to the buffer.

Designing a fast "rolling window" file reader

I'm writing an algorithm in C++ that scans a file with a "sliding window," meaning it will scan bytes 0 to n, do something, then scan bytes 1 to n+1, do something, and so forth, until the end is reached.
My first algorithm was to read the first n bytes, do something, dump one byte, read a new byte, and repeat. This was very slow because to "ReadFile" from HDD one byte at a time was inefficient. (About 100kB/s)
My second algorithm involves reading a chunk of the file (perhaps n*1000 bytes, meaning the whole file if it's not too large) into a buffer and reading individual bytes off the buffer. Now I get about 10MB/s (decent SSD + Core i5, 1.6GHz laptop).
My question: Do you have suggestions for even faster models?
edit: My big buffer (relative to the window size) is implemented as follows:
- for a rolling window of 5kB, the buffer is initialized to 5MB
- read the first 5MB of the file into the buffer
- the window pointer starts at the beginning of the buffer
- upon shifting, the window pointer is incremented
- when the window pointer nears the end of the 5MB buffer, (say at 4.99MB), copy the remaining 0.01MB to the beginning of the buffer, reset the window pointer to the beginning, and read an additional 4.99MB into the buffer.
- repeat
edit 2 - the actual implementation (removed)
Thank you all for many insightful response. It was hard to select a "best answer"; they were all excellent and helped with my coding.

I use a sliding window in one of my apps (actually, several layers of sliding windows working on top of each other, but that is outside the scope of this discussion). The window uses a memory-mapped file view via CreateFileMapping() and MapViewOfFile(), then I have an an abstraction layer on top of that. I ask the abstraction layer for any range of bytes I need, and it ensures that the file mapping and file view are adjusted accordingly so those bytes are in memory. Every time a new range of bytes is requested, the file view is adjusted only if needed.
The file view is positioned and sized on page boundaries that are even multiples of the system granularity as reported by GetSystemInfo(). Just because a scan reaches the end of a given byte range does not necessarily mean it has reached the end of a page boundary yet, so the next scan may not need to alter the file view at all, the next bytes are already in memory. If the first requested byte of a range exceeds the right-hand boundary of a mapped page, the left edge of the file view is adjusted to the left-hand boundary of the requested page and any pages to the left are unmapped. If the last requested byte in the range exceeds the right-hand boundary of the right-most mapped page, a new page is mapped and added to the file view.
It sounds more complex than it really is to implement once you get into the coding of it:
Creating a View Within a File
It sounds like you are scanning bytes in fixed-sized blocks, so this approach is very fast and very efficient for that. Based on this technique, I can sequentially scan multi-GIGBYTE files from start to end fairly quickly, usually a minute or less on my slowest machine. If your files are smaller then the system granularity, or even just a few megabytes, you will hardly notice any time elapsed at all (unless your scans themselves are slow).
Update: here is a simplified variation of what I use:
class FileView
{
private:
DWORD m_AllocGran;
DWORD m_PageSize;
HANDLE m_File;
unsigned __int64 m_FileSize;
HANDLE m_Map;
unsigned __int64 m_MapSize;
LPBYTE m_View;
unsigned __int64 m_ViewOffset;
DWORD m_ViewSize;
void CloseMap()
{
CloseView();
if (m_Map != NULL)
{
CloseHandle(m_Map);
m_Map = NULL;
}
m_MapSize = 0;
}
void CloseView()
{
if (m_View != NULL)
{
UnmapViewOfFile(m_View);
m_View = NULL;
}
m_ViewOffset = 0;
m_ViewSize = 0;
}
bool EnsureMap(unsigned __int64 Size)
{
// do not exceed EOF or else the file on disk will grow!
Size = min(Size, m_FileSize);
if ((m_Map == NULL) ||
(m_MapSize != Size))
{
// a new map is needed...
CloseMap();
ULARGE_INTEGER ul;
ul.QuadPart = Size;
m_Map = CreateFileMapping(m_File, NULL, PAGE_READONLY, ul.HighPart, ul.LowPart, NULL);
if (m_Map == NULL)
return false;
m_MapSize = Size;
}
return true;
}
bool EnsureView(unsigned __int64 Offset, DWORD Size)
{
if ((m_View == NULL) ||
(Offset < m_ViewOffset) ||
((Offset + Size) > (m_ViewOffset + m_ViewSize)))
{
// the requested range is not already in view...
// round down the offset to the nearest allocation boundary
unsigned __int64 ulNewOffset = ((Offset / m_AllocGran) * m_AllocGran);
// round up the size to the next page boundary
DWORD dwNewSize = ((((Offset - ulNewOffset) + Size) + (m_PageSize-1)) & ~(m_PageSize-1));
// if the new view will exceed EOF, truncate it
unsigned __int64 ulOffsetInFile = (ulNewOffset + dwNewSize);
if (ulOffsetInFile > m_FileSize)
dwNewViewSize -= (ulOffsetInFile - m_FileSize);
if ((m_View == NULL) ||
(m_ViewOffset != ulNewOffset) ||
(m_ViewSize != ulNewSize))
{
// a new view is needed...
CloseView();
// make sure the memory map is large enough to contain the entire view
if (!EnsureMap(ulNewOffset + dwNewSize))
return false;
ULARGE_INTEGER ul;
ul.QuadPart = ulNewOffset;
m_View = (LPBYTE) MapViewOfFile(m_Map, FILE_MAP_READ, ul.HighPart, ul.LowPart, dwNewSize);
if (m_View == NULL)
return false;
m_ViewOffset = ulNewOffset;
m_ViewSize = dwNewSize;
}
}
return true;
}
public:
FileView() :
m_AllocGran(0),
m_PageSize(0),
m_File(INVALID_HANDLE_VALUE),
m_FileSize(0),
m_Map(NULL),
m_MapSize(0),
m_View(NULL),
m_ViewOffset(0),
m_ViewSize(0)
{
// map views need to be positioned on even multiples
// of the system allocation granularity. let's size
// them on even multiples of the system page size...
SYSTEM_INFO si = {0};
if (GetSystemInfo(&si))
{
m_AllocGran = si.dwAllocationGranularity;
m_PageSize = si.dwPageSize;
}
}
~FileView()
{
CloseFile();
}
bool OpenFile(LPTSTR FileName)
{
CloseFile();
if ((m_AllocGran == 0) || (m_PageSize == 0))
return false;
HANDLE hFile = CreateFile(FileName, GENERIC_READ, FILE_SHARE_READ, NULL, OPEN_EXISTING, 0, NULL);
if (hFile == INVALID_HANDLE_VALUE)
return false;
ULARGE_INTEGER ul;
ul.LowPart = GetFileSize(hFile, &ul.HighPart);
if ((ul.LowPart == INVALID_FILE_SIZE) && (GetLastError() != 0))
{
CloseHandle(hFile);
return false;
}
m_File = hFile;
m_FileSize = ul.QuadPart;
return true;
}
void CloseFile()
{
CloseMap();
if (m_File != INVALID_HANDLE_VALUE)
{
CloseHandle(m_File);
m_File = INVALID_HANDLE_VALUE;
}
m_FileSize = 0;
}
bool AccessBytes(unsigned __int64 Offset, DWORD Size, LPBYTE *Bytes, DWORD *Available)
{
if (Bytes) *Bytes = NULL;
if (Available) *Available = 0;
if ((m_FileSize != 0) && (offset < m_FileSize))
{
// make sure the requested range is in view
if (!EnsureView(Offset, Size))
return false;
// near EOF, the available bytes may be less than requested
DWORD dwOffsetInView = (Offset - m_ViewOffset);
if (Bytes) *Bytes = &m_View[dwOffsetInView];
if (Available) *Available = min(m_ViewSize - dwOffsetInView, Size);
}
return true;
}
};
.
FileView fv;
if (fv.OpenFile(TEXT("C:\\path\\file.ext")))
{
LPBYTE data;
DWORD len;
unsigned __int64 offset = 0, filesize = fv.FileSize();
while (offset < filesize)
{
if (!fv.AccessBytes(offset, some size here, &data, &len))
break; // error
if (len == 0)
break; // unexpected EOF
// use data up to len bytes as needed...
offset += len;
}
fv.CloseFile();
}
This code is designed to allow random jumping anywhere in the file at any data size. Since you are reading bytes sequentially, some of the logic can be simplified as needed.

Your new algorithm only pays 0.1% of the I/O inefficiencies... not worth worrying about.
To get further throughput improvement, you should take a closer look at the "do something" step. See whether you can reuse part of the result from an overlapping window. Check cache behavior. Check if there's a better algorithm for the same computation.

You have the basic I/O technique down. The easiest improvement you can make now is to pick a good buffer size. With some experimentation, you'll find that read performance increases quickly with buffer size until you hit about 16k, then performance begins to level out.
Your next task is probably to profile your code, and see where it is spending its time. When dealing with performance, it is always best to measure rather than guess. You don't mention what OS you're using, so I won't make any profiler recommendations.
You can also try to reduce the amount of copying/moving of data between your buffer and your workspace. Less copying is generally better. If you can process your data in-place instead of moving it to a new location, that's a win. (I see from your edits you're already doing this.)
Finally, if you're processing many gigabytes of archived information then you should consider keeping your data compressed. It will come as a surprise to many people that it is faster to read compressed data and then decompress it than it is to just read decompressed data. My favorite algorithm for this purpose is LZO which doesn't compress as well as some other algorithms, but decompresses impressively fast. This kind of setup is only worth the engineering effort if:
Your job is I/O bound.
You are reading many G of data.
You're running the program frequently, so it saves you a lot of time to make it run
faster.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js