Memory (and other resources) used by individual VirtualAlloc allocation - c++

How much memory or other resources is used for an individual VirtualAlloc (xxxx, yyy, MEM_RESERVE, zzz)?
Is there any difference in resource consumption (e.g. kernel paged/nonpaged pool) when I allocated one large block, like this:
VirtualAlloc( xxxx, 1024*1024, MEM_RESERVE, PAGE_READWRITE )
or multiple smaller blocks, like this:
VirtualAlloc( xxxx, 64*1024, MEM_RESERVE, PAGE_READWRITE );
VirtualAlloc( xxxx+1*64*1024, 64*1024, MEM_RESERVE, PAGE_READWRITE );
VirtualAlloc( xxxx+2*64*1024, 64*1024, MEM_RESERVE, PAGE_READWRITE );
...
VirtualAlloc( xxxx+15*64*1024, 64*1024, MEM_RESERVE, PAGE_READWRITE );
If someone does not know the answer but can suggest an experiment which would be able to check it, it will be helpful as well.
The motivation is I want to implement returning memory back to OS for TCMalloc under Windows. My idea is to replace individual large VirtualAlloc calls by performing a sequence of small (allocation granularity) calls, so that I can call VirtualFree on each of them. I am aware this way the allocation of large blocks will be slower, but are there any resource consumption penalties be expected?

Just FYI, you can use GetProcessMemoryInfo and GlobalMemoryStatusEx to get some memory usage measurements.
void DisplayMemoryUsageInformation()
{
HANDLE hProcess = GetCurrentProcess();
PROCESS_MEMORY_COUNTERS pmc;
ZeroMemory(&pmc,sizeof(pmc));
GetProcessMemoryInfo(hProcess,&pmc, sizeof(pmc));
std::cout << "PageFaultCount: " << pmc.PageFaultCount << std::endl;
std::cout << "PeakWorkingSetSize: " << pmc.PeakWorkingSetSize << std::endl;
std::cout << "WorkingSetSize: " << pmc.WorkingSetSize << std::endl;
std::cout << "QuotaPeakPagedPoolUsage: " << pmc.QuotaPeakPagedPoolUsage << std::endl;
std::cout << "QuotaPagedPoolUsage: " << pmc.QuotaPagedPoolUsage << std::endl;
std::cout << "QuotaPeakNonPagedPoolUsage: " << pmc.QuotaPeakNonPagedPoolUsage << std::endl;
std::cout << "QuotaNonPagedPoolUsage: " << pmc.QuotaNonPagedPoolUsage << std::endl;
std::cout << "PagefileUsage: " << pmc.PagefileUsage << std::endl;
std::cout << "PeakPagefileUsage: " << pmc.PeakPagefileUsage << std::endl;
MEMORYSTATUSEX msx;
ZeroMemory(&msx,sizeof(msx));
msx.dwLength = sizeof(msx);
GlobalMemoryStatusEx(&msx);
std::cout << "MemoryLoad: " << msx.dwMemoryLoad << std::endl;
std::cout << "TotalPhys: " << msx.ullTotalPhys << std::endl;
std::cout << "AvailPhys: " << msx.ullAvailPhys << std::endl;
std::cout << "TotalPageFile: " << msx.ullTotalPageFile << std::endl;
std::cout << "AvailPageFile: " << msx.ullAvailPageFile << std::endl;
std::cout << "TotalVirtual: " << msx.ullTotalVirtual << std::endl;
std::cout << "AvailVirtual: " << msx.ullAvailVirtual << std::endl;
std::cout << "AvailExtendedVirtual: " << msx.ullAvailExtendedVirtual << std::endl;
}

Zero, or practically Zero, memory is used by making a VirtualAlloc call with the reserve param. This will just reserve the address space within the process. The memory will not be used until you actually back the address with a page by using VirtualAlloc with the commit param.
This is essentially the difference between virtual bytes, the amount of address space taken, and private bytes, the amount of committed memory.
Both of your uses of VirtualAlloc() will reserve the same amount of memory so they are equivalent from the resource consumption side.
I suggest that you do some reading on this before deciding to write your own allocator. One of the best sources for this is Mark Russinivich. You should check his blog. He has written a few entries called pushing the limits which cover some of of this. If you want to get at the real nitty gritty details, then you should read his book (Microsoft Windows Internals). This is by far the best reference that I have read on how windows manages the memory (and everything else).
(Edit) Additional Information:
The relevant pieces are the "Page Directory" and the "Page Table". According to my older copy of Microsoft Windows Internals... On x86, there is a single Page Directory for each process with 1024 entries. There are up to 512 page tables. Each 32 bit pointer used in the process is broken into 3 pieces [31-22]Page Directory Index, [21-12] is the Page Table Index, and [11-0] is the byte index in the page.
When you use virtual alloc with the reserve param, the Page Directory Entry is created (32 bits), and the Page Table Entry is created 32 bits. At this time, the page is not created for the reserved memory.
The best way to see this information is to use the Kernel Debugger. I would suggest using LiveKD (sysinternals). You can use liveKD without attaching a remote computer, but it doesn't allow live debugging. Load LiveKD, and select your process. Then you can run the !PTE command to examine the page table for the process.
Again, I would suggest reading Inside Windows Internals. In my version (4th ed) there is a chapter(over 100 pages) that covers all of this with examples for walking through the various data structures in liveKD.

In my understanding of the page table, you have chunks for e.g. 1024 pages, with one word per page each. In any case, it's the number of pages, not allocations, that cost. Hwoever, there might be other mechanisms that cost "extra" per allocation (I just don't know).
Still: using VirtualFree you can selectively decommit individual pages or page ranges. For a decommitted page, the virtual address range (within your process) is still reserved, but no physical memory (RAM or swap file) is assigned to it. You can later use VirtualAlloc to commit these pages again.
So unless you need to free up address space for other allocators within your process, you can use this mechanism to selectively request and return memory to the OS.
[edit]
Measuring
For measuring, I though of comparing the performance of both algorithms under one or more typical loads (artificial/random allocation pattern, an allocation-heavy "real world" application, etc.). Advantage: you get the "whole story" - Kernel resources, page fragmentation, application performance etc. Disadvantage: you have to implement both algorithms, you don't know the reason, and you probably need very special cases for a measureable difference that sticks out from the noise.
Address space fragmentation warning - be careful with your return algorithm. When returning individual pages to the process in an "whoever is free" fashion, you might end up with an fragmented address space that has 80% of free memory but no 100K of it consecutive.

You can try to use "perfmon" and add Counters (e.g. Memory) to start getting a feel of what resources are being used by VirtualAlloc. You will have to take a snapshot before and after the call to VirtualAlloc
Another option could be to debug the process making call to VirtualAlloc under WinDBG and use the memory related commands http://windbg.info/doc/1-common-cmds.html#20_memory_heap to get an idea of what is actually happening.

Related

Tricky issue with accessing the data of a (complex) variable, defined as pointer (C++)

This is what I assume to know: When printing the variable "i" in the console, the command "std::cout << i" is usually used. If "i" refers to a pointer, I do "std::cout << *i". So far so good.
But in the (classical CUDA FFT) C++ example below, this doesn't work and I don't know why. Here's the code snippet (I don't want to post the entire code for clarity):
cufftExecR2C(fftPlanFwd, (cufftReal *)d_PaddedData, (cufftComplex *)d_DataSpectrum);
for (int i = 0; i < 8; i++)
{
std::cout << (cufftComplex*)d_DataSpectrum << '\n';
}
"cufftExecR2C" is the 'real to complex' command to do a Fourier transform of numbers 0 to 8 in the variable "d_PaddedData" to the variable "d_DataSpectrum", which has complex numbers. I think these are pointers.
I want to check whether the rest of the code works as intended and print out the variable. Problem: I get this 9 times: "000000070CE00C00" printed in the console, which seems to me the address and not the data of the variable?
Instead of (cufftComplex*)d_DataSpectrum, I tried the following combinations: "*d_DataSpectrum", "d_DataSpectrum", "*d_DataSpectrum[0]", "d_DataSpectrum[0]". The latter two are in probable case the complex variable is stored in a two column array. Only "d_DataSpectrum" can be compiled successfully but also gives me 000000070CE00C00.
I would like to know if I miss any tricks to get the data of that pointer?
Edit:
Declarations
fComplex *d_DataSpectrum;
cudaMalloc((void **)&d_DataSpectrum, fftH * (fftW / 2 + 1) * sizeof(fComplex))
Both lines compile but cause the .exe to crash when calling at exactly that point:
std::cout << "The Original data is " << d_PaddedData[i] << '\n';
std::cout << "The FFT'd data is" << ((cufftComplex*)d_DataSpectrum)[i].x << '\n'; //crash even with added 'd_DataSpectrum)[i].y' as recommended
Edit2:
After editing it with a "&" to:
std::cout << "The Original data is " << &d_PaddedData[i] << '\n';
std::cout << "The FFT'd data is" << &((cufftComplex*)d_DataSpectrum)[i].x << '\n';
It runs:
The FFT'd data is000000070CE00C10
The Original data is 000000070CE0040C
The FFT'd data is000000070CE00C18
The Original data is 000000070CE00410
The FFT'd data is000000070CE00C20
The Original data is 000000070CE00414
The FFT'd data is000000070CE00C28
Why does the "&" work now?
which seems to me the address and not the data of the variable?
Yes, that's an address, most likely the (starting) address of your real or complex array d_DataSpectrum (please provide the declaration!). You are not dereferencing here, you simply try to cast to a pointer of cufftComplex. Keep in mind, that you want to print out a complex value, not a single default printable one like float or double!
cufftComplex is (commonly) defined this way:
typedef cuComplex cufftComplex;
So, under the premise, that your initial array d_DataSpectrum is 'compatible' to cuComplex, you should be able to print your complex value via the way john mentioned in the comments (separated access to real and complex portion):
std::cout << ((cufftComplex*)d_DataSpectrum)[i].x << ' ' << ((cufftComplex*)d_DataSpectrum)[i].y << '\n';
The reason why you observe a crash:
fComplex *d_DataSpectrum;
cudaMalloc((void **)&d_DataSpectrum, fftH * (fftW / 2 + 1) * sizeof(fComplex))
This is allocating a device-side memory buffer. For a proper print-out, you have to transfer it back to the host first via cudaMemcpy with cudaMemcpyDeviceToHost to a valid allocated host-side buffer. Even if it's a question too, for the general scheme very similar to your one, see
https://forums.developer.nvidia.com/t/cufft-cufftplan1d-and-cufftexecr2c-issues/43811
A general minimal example for host vs device array usage and printing values:
https://gist.github.com/dpiponi/1502434

Qt QBuffer bytes written cannot be read

A bit of confusion here: I'm trying to do this:
QBuffer _ourLogMessageBuffer;
QByteArray theLogMessage;
...
qDebug() << "Writing " << theLogMessage.size() << " bytes to buffer\n";
qint64 numberOfBytes - _ourLogMessagesBuffer.write(theLogMessage);
qDebug() << "Wrote " << numberOfBytes << " bytes to buffer\n";
qDebug() << "Buffer has " << _ourLogMessagesBuffer.bytesAvailable()
<< " bytes available to read (after write)\n";
This outputs the following:
Writing 196 bytes to buffer
Wrote 196 bytes to buffer
Buffer has 0 bytes available to read (after write)
That last line really confuses me. I thought the return value from the .write() method was supposed to say how many bytes were written? Why would they not be available?
And, later, I attempt the following:
qDebug() << "Buffer has " << _ourLogMessagesBuffer.bytesAvailable()
<< " bytes available to read (before read)\n";
char logMessageBytes[565];
qint64 numberOfBytes = _ourLogMessagesBuffer.read(logMessageBytes, 565);
qDebug() << "Read " << numberOfBytes << " bytes from buffer\n";
Considering the previous bytesAvailable result, the output of these calls aren't too surprising. They output the following:
Buffer has 0 bytes available to read (before read)
Read 0 bytes from buffer
So I feel like I'm missing a step, and that you have to do something between writing and the data being available to read. Perhaps some sort of seek or something? But I seem to be missing where it says that in the documentation.
Any tips would be appreciated. Thank you!
You need to seek back to the position you want to read from:
_ourLogMessagesBuffer.seek(0);
Then you will be able to see an appropriate amount of bytesAvailable. If you think about as a (physical) pointer to a position on a tape, it makes sense. As you write, the pointer moves to the end where it can write more data. Any tape ahead of the pointer is "blank"; there's nothing to read (for a "blank" tape, a new or empty buffer).
When just writing, the position is automatically updated for you. But if you want to read data you already wrote, you need to tell it to go back.
An exception to this is with, say, a file format. If we are modifying an existing file, we could update a fixed-length timestamp in one part, then immediately read a couple bytes denoting the length of an "author" string, and then read that string in. For that, we would not need a seek as all the data is contiguous, and the write and read functions handle moving the position within the file (buffer) automatically.
If you have non-contiguous reads/writes, you need to seek. Otherwise, it can't read your mind on where you want to read from.

Most efficient way to output a newline

I was wondering what is the most efficient performant way to output a new line to console. Please explain why one technique is more efficient. Efficient in terms of performance.
For example:
cout << endl;
cout << "\n";
puts("");
printf("\n");
The motivation for this question is that I find my self writing loops with outputs and I need to output a new line after all iterations of the loop. I'm trying to find out what's the most efficient way to do this assuming nothing else matters. This assumption that nothing else matters is probably wrong.
putchar('\n') is the most simple and probably fastest. cout and printf with string "\n" work with null terminated string and this is slower because you process 2 bytes (0A 00). By the way, carriage return is \r = 13 (0x0D). \n code is Line Feed (LF).
You don't specify whether you are demanding that the update to the screen is immediate or deferred until the next flush. Therefore:
if you're using iostream io:
cout.put('\n');
if you're using stdio io:
std::putchar('\n');
The answer to this question is really "it depends".
In isolation - if all you're measuring is the performance of writing a '\n' character to the standard output device, not tweaking the device, not changing what buffering occurs - then it will be hard to beat options like
putchar('\n');
fputchar('\n', stdout);
std::cout.put('\n');
The problem is that this doesn't achieve much - all it does (assuming the output is to a screen or visible application window) is move the cursor down the screen, and move previous output up. Not exactly a entertaining or otherwise valuable experience for a user of your program. So you won't do this in isolation.
But what comes into play to affect performance (however you measure that) if we don't output newlines in isolation? Let's see;
Output of stdout (or std::cout) is buffered by default. For the output to be visible, options include turning off buffering or for the code to periodically flush the buffer. It is also possible to use stderr (or std::cerr) since that is not buffered by default - assuming stderr is also directed to the console, and output to it has the same performance characteristics as stdout.
stdout and std::cout are formally synchronised by default (e.g. look up std::ios_base::sync_with_stdio) to allow mixing of output to stdout and std::cout (same goes for stderr and std::cerr)
If your code outputs more than a set of newline characters, there is the processing (accessing or reading data that the output is based on, by whatever means) to produce those other outputs, the handling of those by output functions, etc.
There are different measures of performance, and therefore different means of improving efficiency based on each one. For example, there might be CPU cycles, total time for output to appear on the console, memory usage, etc etc
The console might be a physical screen, it might be a window created by the application (e.g. hosted in X, windows). Performance will be affected by choice of hardware, implementation of windowing/GUI subsystems, the operating system, etc etc.
The above is just a selection, but there are numerous factors that determine what might be considered more or less performance.
On Ubuntu 15.10, g++ v5.2.1 (and an older vxWorks, and OSE)
It is easy to demonstrate that
std::cout << std::endl;
puts a new line char into the output buffer, and then flushes the buffer to the device.
But
std::cout << "\n";
puts a new line char into the output buffer, and does not output to the device. Some future action will be needed to trigger the output of the newline char in the buffer to the device.
Two such actions are:
std::cout << std::flush; // will output the buffer'd new line char
std::cout << std::endl; // will output 2 new line chars
There are also several other actions that can trigger the flush of the std::cout buffering.
#include <unistd.h> // for Linux
void msDelay (int ms) { usleep(ms * 1000); }
int main(int, char**)
{
std::cout << "with endl and no delay " << std::endl;
std::cout << "with newline and 3 sec delay " << std::flush << "\n";
msDelay(3000);
std::cout << std::endl << " 2 newlines";
return(0);
}
And, per comment by someone who knows (sorry, I don't know how to copy his name here), there are exceptions for some environments.
It's actually OS/Compiler implementation dependent.
The most efficient, least side effect guaranteed way to output a '\n' newline character is to use std::ostream::write() (and for some systems requires std::ostream was opened in std::ios_base::binary mode):
static const char newline = '\n';
std::cout.write(&newline,sizeof(newline));
I would suggest to use:
std::cout << '\n'; /* Use std::ios_base::sync_with_stdio(false) if applicable */
or
fputc('\n', stdout);
And turn the optimization on and let the compiler decide what is best way to do this trivial job.
Well if you want to change the line I'd like to add the simplest and the most common way which is using (endl), which has the added perk of flushing the stream, unlike cout << '\n'; on its own.
Example:
cout << "So i want a new line" << endl;
cout << "Here is your new line";
Output:
So i want a new line
Here is your new line
This can be done for as much new lines you want. Allow me to show an example using 2 new lines, it'll definitely clear all of your doubts,
Example:
cout << "This is the first line" << endl;
cout << "This is the second line" << endl;
cout << "This is the third line";
Output:
This is the first line
This is the second line
This is the third line
The last line will just have a semicolon to close since no newline is needed. (endl) is also chain-able if needed, as an example, cout << endl << endl; would be a valid sequence.

C++ Memory leaks on Windows 7

I'm writing a program (C++, MinGW 32 bit) to batch process images using OpenCV functions, using AngelScript as a scripting language. As of right now, my software has some memory leaks that add up pretty quickly (the images are 100-200 MB each, and I'm processing thousands at once) but I'm running into an image where Windows doesn't seem to be releasing the memory used by my program until rebooting.
If I run it on a large set of images, it runs for a while and eventually OpenCV throws an exception saying that it's out of memory. At that point, I close the program, and Task Manager's physical memory meter drops back down to where it was before I started. But here's the catch - every time I try to run the program again, it will fail right off the bat to allocate memory to OpenCV, until I reboot the computer, at which point it will work just great for a few hundred images again.
Is there some way Windows could be holding on to that memory? Or is there another reason why Windows would fail to allocate memory to my program until a reboot occurs? This doesn't make sense to me.
EDIT: The computer I'm running this program on is Windows 7 64 bit with 32 GB of ram, so even with my program's memory issues, it's only using a small amount of the available memory. Normally the program maxes out at a little over 1 GB of ram before it quits.
EDIT 2: I'm also using FreeImage to load the images, I forgot to mention that. Here's the basis of my processing code:
//load bitmap with FreeImage
FIBITMAP *bitmap = NULL;
FREE_IMAGE_FORMAT fif = FIF_UNKNOWN;
fif = FreeImage_GetFileType(filename.c_str(), 0);
bitmap = FreeImage_Load(fif, filename.c_str(), 0);
if (!bitmap) {
LogString("ScriptEngine: input file is not readable.");
processingFile = false;
return false;
}
//convert FreeImage bitmap to my custom wrapper for OpenCV::Mat
ScriptImage img;
img.image = fi2cv(bitmap);
FreeImage_Unload(bitmap);
try {
//this executes the AngelScript code
r = ctx->Execute();
} catch(std::exception e) {
std::cout << "Exception in " << __FILE__ << ", line " << __LINE__ << ", " << __FUNCTION__ << ": " << e.what() << std::endl;
}
try {
engine->GarbageCollect(asGC_FULL_CYCLE | asGC_DESTROY_GARBAGE);
} catch (std::exception e) {
std::cout << "Exception in " << __FILE__ << ", line " << __LINE__ << ", " << __FUNCTION__ << ": " << e.what() << std::endl;
}
As you can see, the only pointer is to the FIBITMAP, which is freed.
It is very likely that you are making a copy of the image data on this line:
img.image = fi2cv(bitmap);
Since you are immediately freeing the bitmap afterwards, that data must persist after the free.
Check if there is a resource release for ScriptImage objects.

Listing All Physical Drives (Windows)

How can I get all the physical drive paths (\\.\PhysicalDriveX) on a Windows computer, with C/C++?
The answers in this question suggest getting the logical drive letter, and then getting the physical drive corresponding to that mounted drive. The problem is, I want to get all
physical drives connected to the computer, including drives that are not mounted.
Other answers suggest incrementing a value from 0-15 and checking if a drive exists there (\\.\PhysicalDrive0, \\.\PhysicalDrive1, ...) or calling WMIC to list all the drives.[
As these seem like they would work, they seem like they are not the best approach to be taking. Is there not a simple function such as GetPhysicalDrives that simply returns a vector of std::string's containing the paths of all the physical drives?
You can use QueryDosDevice. Based on the description, you'd expect this to list things like C: and D:, but it will also lists things like PhysicalDrive0, PhysicalDrive1 and so on.
The major shortcoming is that it will also list a lot of other device names you probably don't care about, so (for example) on my machine, I get a list of almost 600 device names, of which only a fairly small percentage is related to what you care about.
Just in case you care, some (old) sample code:
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
#include <iostream>
int main(int argc, char **argv) {
char physical[65536];
char logical[65536];
if ( argc > 1) {
for (int i=1; i<argc; i++) {
QueryDosDevice(argv[i],logical, sizeof(logical));
std::cout << argv[i] << " : \t" << logical << std::endl << std::endl;
}
return 0;
}
QueryDosDevice(NULL, physical, sizeof(physical));
std::cout << "devices: " << std::endl;
for (char *pos = physical; *pos; pos+=strlen(pos)+1) {
QueryDosDevice(pos, logical, sizeof(logical));
std::cout << pos << " : \t" << logical << std::endl << std::endl;
}
return 0;
}
However, if I run this like `devlist | grep "^Physical", it lists the physical drives.
Yes, you can just type NET USE. Here is an example output...
NET USE
New connections will be remembered.
Status Local Remote Network
-------------------------------------------------------------------------------
H: \\romaxtechnology.com\HomeDrive\Users\Henry.Tanner
Microsoft Windows Network
OK N: \\ukfs01.romaxtechnology.com\romaxfs
Microsoft Windows Network
OK X: \\ukfs03.romaxtechnology.com\exchange
Microsoft Windows Network
OK Z: \\ukfs07\Engineering Microsoft Windows Network
\\romaxtechnology.com\HomeDrive
Microsoft Windows Network
OK \\ukfs07\IPC$ Microsoft Windows Network
The command completed successfully.