memory image decrease - c++

I am running this code:
#include <iostream>
#include <cstddef>
int main(int argc, char *argv[]){
int a1=0, a2=0;
int a3,a4;
int b1=++a1;
int b2=a2++;
int*p1=&a1;
int*p2=&++a1;
size_t st;
ptrdiff_t pt;
int i=0;
while(true){
printf("i: %d",i++);
}
printf("\n\ni now is: %d\n",i);
return 0;
}
why do I observe such decrease in image memory (fiolet):
legend:
I made this general Win32 project, not CLR.
I changed the code, so I will see when int has become finally negative. Now the while() is:
int i=0;
while(0<++i){
printf("i: %d",i++);
}
printf("\n\ni now is: %d\n",i);
It is strange: please see what happend after just 30000 iterations. Why do we see these fluctuations in the image memory? I can see now that probably this is involved with VMMap itself, because it happens only if I choose "launch & trace a new process" but not when "view a running process" and point to running exe fired from VS2010. Here is the screen of process "launched & traced":
I observed also huge paging of memory, which started roughly with this decline in image (this paging nearly accelerated and quickly triggered RAM limit, that I have set to 2GB):
and here is a running process only "viewed" (runned from VS2010):
so maybe some issue subject to memory management of the .NET applications takes place here?
I am still waiting for my int to cross boundary of two complement.
well... I have to edit again: it turns out that as previously thought - the decreasing memory image effect is present when process is only viewed (not launched) too. Below is attached picture of the same process 10 minutes later (still waiting for turning int into negative):
and here it is:
so the biggest positive 2-complement on my machine is 2 147 483 647
and smallest negative is -2 147 483 648, what is easy to verify this way:
#include <limits>
const int min_int = std::numeric_limits<int>::min();
const int max_int = std::numeric_limits<int>::max();
it gave me the same result: -2 147 483 648 and 2 147 483 647
back to the beginning
when I comment everything but the while() loop - the same thing happens: image is decreasing after process was running about 10 minutes, so it is not the useless code that cause this. but what?

Working set is largely under control of the operating system. What your code does is only one factor it considers when deciding whether to grow or trim your working set. Other factors include whether your application is in the foreground, how active it is, how greedy the heap algorithm is, how much memory pressure exists because of the demands of other processes, etc. This is by design.
The drops are probably related to Windows choosing to trim your working set. Since most of the code that was originally loaded was probably just for initialization and is not involved in the loop, it was simple for the OS to reclaim image pages based on a LRU-algorithm.
Notice that the working set allocated to the image size is not the only portion that was trimmed.

Related

cxxrt::bad_alloc despite large EPC

I am running the following piece of code inside an SGX enclave:
void test_enclave_size() {
unsigned int i = 0;
const unsigned int MB = 1024 * 1024;
try {
for (; i < 10000; i++) {
char* tmp = new char[MB];
}
} catch (const std::exception &e) {
std::cout << "Crash with " << e.what() << " " << i << std::endl;
}
}
On my dev machine with the standard 128 MB EPC this throws a bad cxxrt::bad_alloc after 118 MB, which makes sense because I believe only 96 MB is guaranteed to be available for enclave programs. However, when running this code on a Standard_DC32s_V3, which has 192 GB of EPC memory, I get the exact same result. I assumed that because the EPC is advertised to be extremely large, I should be able to allocate far more than 128 MB.
I have thought of a couple of reasons why this might be happening:
While the EPC is now 192 GB in size, each process is still limited to 128 MB.
There is something in the kernel that needs to be enabled for me to take advantage of this large EPC.
I am misunderstanding what Azure is advertising.
I wanted to see if anyone has a good idea of what is happening before contact Azure support, since this might be a user error.
Edit:
It turns out my second reason was the closest. As X99 pointed out, when developing an enclave application there is a configuration file that defines several factors such as the number of thread contexts, whether debugging is enabled, and max heap/stack size. My maximum heap size in my configuration was set to about 118 MB, which explains why I started to get bad allocations past this amount. Once I increased the number, the issue went away. Size note: if you are on Linux, the drivers support paging. This means you can use as much memory as you wish if you can afford to suffer the paging overhead.
If you are using Open Enclave as your SDK, this configuration file (example) is what you should be editing. In this example, the maximum heap and stack are 1024 pages, which is about 4MB. This page may be useful to you as well!
If the machine you're running your enclave on has more than 128Mb of EPC AND will allow you to go further (because of a BIOS setting), there is one more setting you must fiddle with, in your Enclave.config.xml file:
<EnclaveConfiguration>
<ProdID>0</ProdID>
<ISVSVN>0</ISVSVN>
<StackMaxSize>0x40000</StackMaxSize>
<HeapMaxSize>0x100000</HeapMaxSize>
<TCSNum>10</TCSNum>
<TCSPolicy>1</TCSPolicy>
<!-- Recommend changing 'DisableDebug' to 1 to make the enclave undebuggable for enclave release -->
<DisableDebug>0</DisableDebug>
<MiscSelect>0</MiscSelect>
<MiscMask>0xFFFFFFFF</MiscMask>
</EnclaveConfiguration>
To be more specific, the HeapMaxSize value. See the SGX developper reference, page 58/59.

How to diagnose a visual studio project slowing down as time goes on?

Computer:
Processor: Intel Xeon Silver 4114 CPU # 2.19Ghz (2 processors)
Ram: 96 Gb 2666 Hz: 12 - 8 Gb sticks
OS: Windows 10
GPU: None
Hard drive: Samsung MZVLB512HAJQ-000H2 - 512GB M.2 PCIe NVMe
IDE:
Visual Studio 2019
I am including what I am doing in case it is relevant. I am running a visual studio code where I read data off a GSC PCI SIO4B Sync Card 256K. Using the API for this card (Documentation: http://www.generalstandards.com/downloads/GscApi.1.6.10.1.pdf) I read 150 bytes of data at a speed of 100Hz using the code below. That data is then being split into to the message structure my device. I can’t give info on the message structure but the data is then combined into the various words using a union and added to an integer array int Data[100];
Union Example:
union data_set{
unsigned int integer;
unsigned char input[2];
} word;
Example of how the data is read read:
PLX_PHYSICAL_MEM cpRxBuffer;
#define TEST_BUFFER_SIZE 0x400
//allocates memory for the buffer
cpRxBuffer.Size = TEST_BUFFER_SIZE;
status = GscAllocPhysicalMemory(BoardNum, &cpRxBuffer);
status = GscMapPhysicalMemory(BoardNum, &cpRxBuffer);
memset((unsigned char*)cpRxBuffer.UserAddr, 0xa5, sizeof(cpRxBuffer));
// start data reception:
status = GscSio4ChannelReceivePlxPhysData(BoardNum, iRxChannel, &cpRxBuffer, SetMaxBytes, &messageID);
// wait for Rx operation to complete
status = GscSio4ChannelWaitForTransfer(BoardNum, iRxChannel, 7000, messageID, &amount);
if (status)
{
// If we have an error, "bytesTransferred" will contain the number of bytes that we
// actually transmitted.
DisplayErrorMessage(status);
printf("\n\t%04X bytes out of %04X transferred", amount, SetMaxBytes);
}
My issue is that this code works fine and keeps up for around 5 minutes then randomly it stops being able to keep up and the FIFO (first in first out) register on the PCI card begins to fill up faster than the code can process the data. To me this seems like a memory leak issue since the code works fine for a long time, then starts to slow down when nothing has changed as all the code is doing it reading the data off the card. We used to save the data in a really large array but even after removing that we had the same issue.
I am unsure how to figure out exactly what is happening and I'm hopping for a way to determine if there is a memory leak and how to fix it if there is.
It being a data leak is only a guess though and it very well could be something else that is the problem so any out of the box suggestions for diagnosing the problem are also appreciated.
Similar to Paul's answer, but I like to strategically place two (or more) _CrtMemCheckpoint followed by _CrtMemDifference, to cut down the noise.
Memory leaks can be detected and reported on (in Debug builds) by calling the _CrtDumpMemoryLeaks function. When running under the debugger, this will tell you (in the output tab) how many allocations you have at the time that it is called and the file and line number that each was allocated from.
Call this right at the end of your program, after you (think you) have freed all the resources you use. Anything left over is a candidate for being a leak.

WinAPI calling code at specific timestamp

Using functions available in the WinAPI, is it possible to ensure that a specific function is called according to a milisecond precise timestamp? And if so what would be the correct implementation?
I'm trying to write tool assisted speedrun software. This type of software sends user input commands at very exact moments after the script is launched to perform humanly impossible inputs that allow faster completion of videogames. A typical sequence looks something like this:
At 0 miliseconds send right key down event
At 5450 miliseconds send right key up, and up key down event
At 5460 miliseconds send left key down event
etc..
What I've tried so far is listed below. As I'm not experienced in the low level nuances of high precision timers I have some results, but no understanding of why they are this way:
Using Sleep in combination with timeBeginPeriod set to 1 between inputs gave the worst results. Out of 20 executions 0 have met the timing requirement. I believe this is well explained in the documentation for sleep Note that a ready thread is not guaranteed to run immediately. Consequently, the thread may not run until some time after the sleep interval elapses. My understanding is that Sleep isn't up for this task.
Using a busy wait loop checking GetTickCount64 with timeBeginPeriod set to 1 produced slightly better results. Out of 20 executions 2 have met the timing requirement, but apparently that was just a fortunate circumstance. I've looked up some info on this timing function and my suspicion is that it doesn't update often enough to allow 1 milisecond accuracy.
Replacing the GetTickCount64 with the QueryPerformanceCounter improved the situation slightly. Out of 20 executions 8 succeded. I wrote a logger that would store the QPC timestamps right before each input is sent and dump the values in a file after the sequence is finished. I even went as far as to preallocate space for all variables in my code to make sure that time isn't wasted on needless explicit memory allocations. The log values diverge from the timestamps I supply the program by anything from 1 to 40 miliseconds. General purpose programming can live with that, but in my case a single frame of the game is 16.7 ms, so in the worst possible case with delays like these I can be 3 frames late, which defeats the purpose of the whole experiment.
Setting the process priority to high didn't make any difference.
At this point I'm not sure where to look next. My two guesses are that maybe the time that it takes to iterate the busy loop and check the time using (QPCNow - QPCStart) / QPF is itself somehow long enough to introduce the mentioned delay, or that the process is interrupted by the OS scheduler somwhere along the execution of the loop and control returns too late.
The game is 100% deterministic and locked at 60 fps. I am convinced that if I manage to make the input be timed accurately the result will always be 20 out of 20, but at this point I'm begining to suspect that this may not be possible.
EDIT: As per request here is a stripped down testing version. Breakpoint after the second call to ExecuteAtTime and view the TimeBeforeInput variables. For me it reads 1029 and 6017(I've omitted the decimals) meaning that the code executed 29 and 17 miliseconds after it should have.
Disclaimer: the code is not written to demonstrate good programming practices.
#include "stdafx.h"
#include <windows.h>
__int64 g_TimeStart = 0;
double g_Frequency = 0.0;
double g_TimeBeforeFirstInput = 0.0;
double g_TimeBeforeSecondInput = 0.0;
double GetMSSinceStart(double& debugOutput)
{
LARGE_INTEGER now;
QueryPerformanceCounter(&now);
debugOutput = double(now.QuadPart - g_TimeStart) / g_Frequency;
return debugOutput;
}
void ExecuteAtTime(double ms, INPUT* keys, double& debugOutput)
{
while(GetMSSinceStart(debugOutput) < ms)
{
}
SendInput(2, keys, sizeof(INPUT));
}
INPUT* InitKeys()
{
INPUT* result = new INPUT[2];
ZeroMemory(result, 2*sizeof(INPUT));
INPUT winKey;
winKey.type = INPUT_KEYBOARD;
winKey.ki.wScan = 0;
winKey.ki.time = 0;
winKey.ki.dwExtraInfo = 0;
winKey.ki.wVk = VK_LWIN;
winKey.ki.dwFlags = 0;
result[0] = winKey;
winKey.ki.dwFlags = KEYEVENTF_KEYUP;
result[1] = winKey;
return result;
}
int _tmain(int argc, _TCHAR* argv[])
{
INPUT* keys = InitKeys();
LARGE_INTEGER qpf;
QueryPerformanceFrequency(&qpf);
g_Frequency = double(qpf.QuadPart) / 1000.0;
LARGE_INTEGER qpcStart;
QueryPerformanceCounter(&qpcStart);
g_TimeStart = qpcStart.QuadPart;
//Opens windows start panel one second after launch
ExecuteAtTime(1000.0, keys, g_TimeBeforeFirstInput);
//Closes windows start panel 5 seconds later
ExecuteAtTime(6000.0, keys, g_TimeBeforeSecondInput);
delete[] keys;
Sleep(1000);
return 0;
}

Why does Sleep() slow down subsequent code for 40ms?

I originally asked about this at coderanch.com, so if you've tried to assist me there, thanks, and don't feel obliged to repeat the effort. coderanch.com is mostly a Java community, though, and this appears (after some research) to really be a Windows question, so my colleagues there and I thought this might be a more appropriate place to look for help.
I have written a short program that either spins on the Windows performance counter until 33ms have passed, or else calls Sleep(33). The former exhibits no unexpected effects, but the latter appears to (inconsistently) slow subsequent processing for about 40ms (either that, or it has some effect on the values returned from the performance counter for that long). After the spin or Sleep(), the program calls a routine, runInPlace(), that spins for 2ms, counting the number of times it queries the performance counter, and returning that number.
When the initial 33ms delay is done by spinning, the number of iterations of runInPlace() tends to be (on my Windows 10, XPS-8700) about 250,000. It varies, probably due to other system overhead, but it varies smoothing around 250,000.
Now, when the initial delay is done by calling Sleep(), something strange happens. A lot of the calls to runInPlace() return a number near 250,000, but quite a few of them return a number near 50,000. Again, the range varies around 50,000, fairly smoothly. But, it is clearly averaging one or the other, with nearly no returns anywhere between 80,000 and 150,000. If I call runInPlace() 100 times after each delay, instead of just once, it never returns a number of iterations in the smaller range after the 20th call. As runInPlace() runs for 2ms, this means the behavior I'm observing disappears after 40ms. If I have runInPlace() run for 4ms instead of 2ms, it never returns a number of iterations in the smaller range after the 10th call, so, again, the behavior disappears after 40ms (likewise if have runInPlace() run for only 1ms; the behavior disappears after the 40th call).
Here's my code:
#include "stdafx.h"
#include "Windows.h"
int runInPlace(int msDelay)
{
LARGE_INTEGER t0, t1;
int n = 0;
QueryPerformanceCounter(&t0);
do
{
QueryPerformanceCounter(&t1);
n++;
} while (t1.QuadPart - t0.QuadPart < msDelay);
return n;
}
int _tmain(int argc, _TCHAR* argv[])
{
LARGE_INTEGER t0, t1;
LARGE_INTEGER frequency;
int n;
QueryPerformanceFrequency(&frequency);
int msDelay = 2 * frequency.QuadPart / 1000;
int spinDelay = 33 * frequency.QuadPart / 1000;
for (int i = 0; i < 100; i++)
{
if (argc > 1)
Sleep(33);
else
{
QueryPerformanceCounter(&t0);
do
{
QueryPerformanceCounter(&t1);
} while (t1.QuadPart - t0.QuadPart < spinDelay);
}
n = runInPlace(msDelay);
printf("%d \n", n);
}
getchar();
return 0;
}
Here's some output typical of what I get when using Sleep() for the delay:
56116
248936
53659
34311
233488
54921
47904
45765
31454
55633
55870
55607
32363
219810
211400
216358
274039
244635
152282
151779
43057
37442
251658
53813
56237
259858
252275
251099
And here's some output typical of what I get when I spin to create the delay:
276461
280869
276215
280850
188066
280666
281139
280904
277886
279250
244671
240599
279697
280844
159246
271938
263632
260892
238902
255570
265652
274005
273604
150640
279153
281146
280845
248277
Can anyone help me understand this behavior? (Note, I have tried this program, compiled with Visual C++ 2010 Express, on five computers. It only shows this behavior on the two fastest machines I have.)
This sounds like it is due to the reduced clock speed that the CPU will run at when the computer is not busy (SpeedStep). When the computer is idle (like in a sleep) the clock speed will drop to reduce power consumption. On newer CPUs this can be 35% or less of the listed clock speed. Once the computer gets busy again there is a small delay before the CPU will speed up again.
You can turn off this feature (either in the BIOS or by changing the "Minimum processor state" setting under "Processor power management" in the advanced settings of your power plan to 100%.
Besides what #1201ProgramAlarm said (which may very well be, modern processors are extremely fond of downclocking whenever they can), it may also be a cache warming up problem.
When you ask to sleep for a while the scheduler typically schedules another thread/process for the next CPU time quantum, which means that the caches (instruction cache, data cache, TLB, branch predictor data, ...) relative to your process are going to be "cold" again when your code regains the CPU.

Speed up data logging code

I have a device that outputs 64 bits of binary data at a rate of 1KHz. I am reading the device over USB via a 3rd party DLL, converting the binary data into a float, timestamping it, and writing to file.
I have the following setup at the moment:
int main(int argc, char* argv[])
{
unsigned char Message_Rx[64];
USHORT Bytes_Read=0;
std::ofstream out(argv[1]);
do
{
Result = Comms.USBRead(&Message_Rx[0],&Bytes_Read);
unsigned long now = getTickCount(start);
if(Result != 0)
{
uint16_t msb (Message_Rx[11] & 0xff) \\leftshited 8;
uint16_t lsb (Message_Rx[12] & 0xff);
uint16_t rate = msb | lsb;
char outstring[1024];
sprintf(outstring, "%d\t%.7f", now, (float)rate*0.03125);
out << outstring << "\n";
}
}while(!kbhit());
out.close();
}
(Sorry, formatting gets messed up with >> or <<).
This produces perfectly good results on my desktop. There doesn't appear to be any data missing and the timestamps are continuous and 1ms apart.
143379582 -0.5937500
143379583 -1.5312500
143379584 -1.6250000
143379585 -1.4062500
143379586 -1.1875000
143379587 -1.3437500
143379588 -1.3125000
143379589 -1.3125000
143379590 -1.1562500
But when I run this on the old laptop that I need to use I get timestamps that appear in blocks and it looks like there must be some data missing:
143379582 -0.5937500
143379582 -1.5312500
143379582 -1.6250000
143379582 -1.4062500
143379582 -1.1875000
143379593 -1.3437500
143379593 -1.3125000
143379593 -1.3125000
143379593 -1.1562500
Is there a way to achieve a speedup of my code so that I won't lose data?
To say this loud and clear: for any PC that is not a Intel 486SX, 64kb/s is a utmost laughable rate. Getting a few Mb/s over USB is very doable with small, Dollar-a-piece microcontrollers without any optimization.
Whatever goes wrong needs investigation much more than your code does.
I don't know the Comms library, but that's where I'd look for the place where time is spent.
Other than that, your printing stuff to the screen should take orders of magnitude more time than your processing, but still shouldn't be a problem. As mentioned, 1kS/s * 64 b/S is nothing for modern (read: last twenty years) PC hardware.
I recommend storing the raw data until the key is hit. After the key is pressed, output the data.
You want to remove formatting and output from high performance code areas.
Paraphrasing a song, There will be time enough for printing when the data's done.
Edit 1:
An array-based circular queue is a good data structure to hold the incoming data. This gives you the last N data samples.
Whenever you have issues with performance, your first step should be to profile the code to see what parts of it are taking up time.
However, for your code, I would say that the printing and string handling are unnecessary for the main loop. I would have a separate array of timestamps and within my main loop only acquire data.
After a key is hit, you no longer have timing restrictions and can deal with the somewhat expensive operation of file I/O and building up of the strings.
A final note is that your OS might be stealing CPU cycles from you. You may want to try to run your code with higher priorities to rule out scheduling.
With all that said, as was mentioned above, your data rate should be sustainable unless you're running on some really vintage hardware.