I am running the following piece of code inside an SGX enclave:
void test_enclave_size() {
unsigned int i = 0;
const unsigned int MB = 1024 * 1024;
try {
for (; i < 10000; i++) {
char* tmp = new char[MB];
}
} catch (const std::exception &e) {
std::cout << "Crash with " << e.what() << " " << i << std::endl;
}
}
On my dev machine with the standard 128 MB EPC this throws a bad cxxrt::bad_alloc after 118 MB, which makes sense because I believe only 96 MB is guaranteed to be available for enclave programs. However, when running this code on a Standard_DC32s_V3, which has 192 GB of EPC memory, I get the exact same result. I assumed that because the EPC is advertised to be extremely large, I should be able to allocate far more than 128 MB.
I have thought of a couple of reasons why this might be happening:
While the EPC is now 192 GB in size, each process is still limited to 128 MB.
There is something in the kernel that needs to be enabled for me to take advantage of this large EPC.
I am misunderstanding what Azure is advertising.
I wanted to see if anyone has a good idea of what is happening before contact Azure support, since this might be a user error.
Edit:
It turns out my second reason was the closest. As X99 pointed out, when developing an enclave application there is a configuration file that defines several factors such as the number of thread contexts, whether debugging is enabled, and max heap/stack size. My maximum heap size in my configuration was set to about 118 MB, which explains why I started to get bad allocations past this amount. Once I increased the number, the issue went away. Size note: if you are on Linux, the drivers support paging. This means you can use as much memory as you wish if you can afford to suffer the paging overhead.
If you are using Open Enclave as your SDK, this configuration file (example) is what you should be editing. In this example, the maximum heap and stack are 1024 pages, which is about 4MB. This page may be useful to you as well!
If the machine you're running your enclave on has more than 128Mb of EPC AND will allow you to go further (because of a BIOS setting), there is one more setting you must fiddle with, in your Enclave.config.xml file:
<EnclaveConfiguration>
<ProdID>0</ProdID>
<ISVSVN>0</ISVSVN>
<StackMaxSize>0x40000</StackMaxSize>
<HeapMaxSize>0x100000</HeapMaxSize>
<TCSNum>10</TCSNum>
<TCSPolicy>1</TCSPolicy>
<!-- Recommend changing 'DisableDebug' to 1 to make the enclave undebuggable for enclave release -->
<DisableDebug>0</DisableDebug>
<MiscSelect>0</MiscSelect>
<MiscMask>0xFFFFFFFF</MiscMask>
</EnclaveConfiguration>
To be more specific, the HeapMaxSize value. See the SGX developper reference, page 58/59.
Related
Computer:
Processor: Intel Xeon Silver 4114 CPU # 2.19Ghz (2 processors)
Ram: 96 Gb 2666 Hz: 12 - 8 Gb sticks
OS: Windows 10
GPU: None
Hard drive: Samsung MZVLB512HAJQ-000H2 - 512GB M.2 PCIe NVMe
IDE:
Visual Studio 2019
I am including what I am doing in case it is relevant. I am running a visual studio code where I read data off a GSC PCI SIO4B Sync Card 256K. Using the API for this card (Documentation: http://www.generalstandards.com/downloads/GscApi.1.6.10.1.pdf) I read 150 bytes of data at a speed of 100Hz using the code below. That data is then being split into to the message structure my device. I can’t give info on the message structure but the data is then combined into the various words using a union and added to an integer array int Data[100];
Union Example:
union data_set{
unsigned int integer;
unsigned char input[2];
} word;
Example of how the data is read read:
PLX_PHYSICAL_MEM cpRxBuffer;
#define TEST_BUFFER_SIZE 0x400
//allocates memory for the buffer
cpRxBuffer.Size = TEST_BUFFER_SIZE;
status = GscAllocPhysicalMemory(BoardNum, &cpRxBuffer);
status = GscMapPhysicalMemory(BoardNum, &cpRxBuffer);
memset((unsigned char*)cpRxBuffer.UserAddr, 0xa5, sizeof(cpRxBuffer));
// start data reception:
status = GscSio4ChannelReceivePlxPhysData(BoardNum, iRxChannel, &cpRxBuffer, SetMaxBytes, &messageID);
// wait for Rx operation to complete
status = GscSio4ChannelWaitForTransfer(BoardNum, iRxChannel, 7000, messageID, &amount);
if (status)
{
// If we have an error, "bytesTransferred" will contain the number of bytes that we
// actually transmitted.
DisplayErrorMessage(status);
printf("\n\t%04X bytes out of %04X transferred", amount, SetMaxBytes);
}
My issue is that this code works fine and keeps up for around 5 minutes then randomly it stops being able to keep up and the FIFO (first in first out) register on the PCI card begins to fill up faster than the code can process the data. To me this seems like a memory leak issue since the code works fine for a long time, then starts to slow down when nothing has changed as all the code is doing it reading the data off the card. We used to save the data in a really large array but even after removing that we had the same issue.
I am unsure how to figure out exactly what is happening and I'm hopping for a way to determine if there is a memory leak and how to fix it if there is.
It being a data leak is only a guess though and it very well could be something else that is the problem so any out of the box suggestions for diagnosing the problem are also appreciated.
Similar to Paul's answer, but I like to strategically place two (or more) _CrtMemCheckpoint followed by _CrtMemDifference, to cut down the noise.
Memory leaks can be detected and reported on (in Debug builds) by calling the _CrtDumpMemoryLeaks function. When running under the debugger, this will tell you (in the output tab) how many allocations you have at the time that it is called and the file and line number that each was allocated from.
Call this right at the end of your program, after you (think you) have freed all the resources you use. Anything left over is a candidate for being a leak.
I am currently using shared memory with two mapped files (1.9 GBytes for the first one and 600 MBytes for the second) in a software.
I am using a process that read data from the first file, process the data and write the results to the second file.
I have noticed a strong delay sometimes (the reason is out of my knowledge) when reading or writing to the mapping view with memcpy function.
Mapped files are created this way :
m_hFile = ::CreateFileW(SensorFileName,
GENERIC_READ | GENERIC_WRITE,
0,
NULL,
CREATE_ALWAYS,
FILE_ATTRIBUTE_NORMAL,
NULL);
m_hMappedFile = CreateFileMapping(m_hFile,
NULL,
PAGE_READWRITE,
dwFileMapSizeHigh,
dwFileMapSizeLow,
NULL);
And memory mapping is done this way :
m_lpMapView = MapViewOfFile(m_hMappedFile,
FILE_MAP_ALL_ACCESS,
dwOffsetHigh,
dwOffsetLow,
m_i64ViewSize);
The dwOffsetHigh/dwOffsetLow are "matching" granularity from the system info.
The process is reading about 300KB * N times, storing that in a buffer, processing and then writing 300KB * N times the processed contents of the previous buffer to the second file.
I have two different memory views (created/moved with MapViewOfFile function) with a size of 10 MBytes as default size.
For memory view size, I tested 10kBytes, 100kB, 1MB, 10MB and 100MB. Statistically no difference, 80% of the time reading process is as described below (~200ms) but writing process is really slow.
Normally :
1/ Reading is done in ~200ms.
2/ Process done in 2.9 seconds.
3/ Writing is done in ~200ms.
I can see that 80% of the time, either reading or writing (in the worst case both are slow) will take between 2 and 10 seconds.
Example : For writing, I am using the below code
for (unsigned int i = 0 ; i < N ; i++) // N = 500~3k
{
// Check the position of the memory view for ponderation
if (###)
MoveView(iOffset);
if (m_lpMapView)
{
memcpy((BYTE*)m_lpMapView + iOffset, pANNHeader, uiANNStatus);
// uiSize = ~300 kBytes
memcpy((BYTE*)m_lpMapView + iTemp, pLine[i], uiSize);
}
else
return uiANNStatus;
}
After using GetTickCount function to pinpoint where is the delay, I am seeing that the second memcpy call is always the one taking most of the time.
So, so far I am seeing N (for test, I used N = 500) calls to memcpy taking 10 seconds at the worst time when using those shared memories.
I made a temporary software that was doing the same quantity of memcpy calls, same amount of data and couldn't see the problem.
For tests, I used the following conditions, they all show the same delay :
1/ I can see this on various computers, 32 or 64 bits from windows 7 to windows 10.
2/ Using the main thread or multi-threads (up to 8 with critical sections for synchronization purpose) for reading/writing.
3/ OS on SATA or SSD, memory mapped files of the software physically on a SATA or SSD hard-disk, and if on external hard-disk, tests were done through USB1, USB2 or USB3.
I am kindly asking you what you would think my mistake is for memcpy to go slow.
Best regards.
I found a solution that works for me but not might be the case for others.
Following Thomas Matthews comments, I checked the MSDN and found two interesting functions FlushViewOfFile and FlushFileBuffers (but couldn't find anything interesting about locking memory).
Calling both after the for loop force update of the mapped file.
I am having no more "random" delay, but instead of the expected 200ms, I have an average of 400ms which is enough for my application.
After doing some tests I saw that calling those too often will cause heavy hard-disk access and will make the delay worse (10 seconds for every for loop) so the flush should be use carefully.
Thanks.
I am running this code:
#include <iostream>
#include <cstddef>
int main(int argc, char *argv[]){
int a1=0, a2=0;
int a3,a4;
int b1=++a1;
int b2=a2++;
int*p1=&a1;
int*p2=&++a1;
size_t st;
ptrdiff_t pt;
int i=0;
while(true){
printf("i: %d",i++);
}
printf("\n\ni now is: %d\n",i);
return 0;
}
why do I observe such decrease in image memory (fiolet):
legend:
I made this general Win32 project, not CLR.
I changed the code, so I will see when int has become finally negative. Now the while() is:
int i=0;
while(0<++i){
printf("i: %d",i++);
}
printf("\n\ni now is: %d\n",i);
It is strange: please see what happend after just 30000 iterations. Why do we see these fluctuations in the image memory? I can see now that probably this is involved with VMMap itself, because it happens only if I choose "launch & trace a new process" but not when "view a running process" and point to running exe fired from VS2010. Here is the screen of process "launched & traced":
I observed also huge paging of memory, which started roughly with this decline in image (this paging nearly accelerated and quickly triggered RAM limit, that I have set to 2GB):
and here is a running process only "viewed" (runned from VS2010):
so maybe some issue subject to memory management of the .NET applications takes place here?
I am still waiting for my int to cross boundary of two complement.
well... I have to edit again: it turns out that as previously thought - the decreasing memory image effect is present when process is only viewed (not launched) too. Below is attached picture of the same process 10 minutes later (still waiting for turning int into negative):
and here it is:
so the biggest positive 2-complement on my machine is 2 147 483 647
and smallest negative is -2 147 483 648, what is easy to verify this way:
#include <limits>
const int min_int = std::numeric_limits<int>::min();
const int max_int = std::numeric_limits<int>::max();
it gave me the same result: -2 147 483 648 and 2 147 483 647
back to the beginning
when I comment everything but the while() loop - the same thing happens: image is decreasing after process was running about 10 minutes, so it is not the useless code that cause this. but what?
Working set is largely under control of the operating system. What your code does is only one factor it considers when deciding whether to grow or trim your working set. Other factors include whether your application is in the foreground, how active it is, how greedy the heap algorithm is, how much memory pressure exists because of the demands of other processes, etc. This is by design.
The drops are probably related to Windows choosing to trim your working set. Since most of the code that was originally loaded was probably just for initialization and is not involved in the loop, it was simple for the OS to reclaim image pages based on a LRU-algorithm.
Notice that the working set allocated to the image size is not the only portion that was trimmed.
Good afternoon, We have implemented a C++ cKeyArray class to test whether we can use the Large File API to save physical memory. During Centos Linux testing, we found that the Linux File API was just as fast as using the heap for random access processing. Here are the numbers: for a 2,700,000 row SQL database where the KeySize for each row is 62 bytes,
cKeyArray class using LINUX File API
BruteForceComparisons = 197275 BruteForceTimeElapsed = 1,763,504,445 microsecs
Each BruteForce Comparisons requires two random access, there the mean time required for each random access = 1,763,504,445 microsecs / (2 * 197275) = 4470 microsecs
Heap , no cKeyArray class
BruteForceComparisons = 197275 BruteForceTimeElapsed = 1,708,442,690microsecs
the mean time required for each random access = 4300 microsecs.
On 32 bit Windows,the numbers are,
cKeyArray class using Windows File API
BruteForceComparisons = 197275 BruteForceTimeElapsed = 9243787 millisecs
the mean time for each random access is 23.4 millisec
Heap, no cKeyArray class
BruteForceComparisons = 197275 BruteForceTimeElapsed = 2,141,941 millisecs
the mean time requires for each random access is 5.4 millisec
We are wondering why the Linux cKeyArray numbers are just as good the Linux heap numbers while on 32 bit Windows the mean heap random access time is 4 times as fast the cKeyArray Windows File API. Is there some way we can speed up the Windows cKeyArray File API?
Previouly, we received a lot of good suggestions from Stack Overflow on using the Windows Memory Mapped File API. Based on these Stack Overflow suggestions we have implemented a Memory Mapped File MRU caching class which functions properly.
Because we want to devlop a cross-platform solution, we want to do due diligence to see why the Linux File API is so fast? Thank you. We are trying to post a portion of the cKeyArray class implementation below.
#define KEYARRAY_THRESHOLD 100000000
// Use file instead of memory if requirement is above this number
cKeyArray::cKeyArray(long RecCount_,int KeySize_,int MatchCodeSize_, char* TmpFileName_) {
RecCount=RecCount_;
KeySize=KeySize_;
MatchCodeSize=MatchCodeSize_;
MemBuffer=0;
KeyBuffer=0;
MemFile=0;
MemFileName[0]='\x0';
ReturnBuffer=new char[MatchCodeSize + 1];
if (RecCount*KeySize<=KEYARRAY_THRESHOLD) {
InMemory=true;
MemBuffer=new char[RecCount*KeySize];
memset(MemBuffer,0,RecCount*KeySize);
} else {
InMemory=false;
strcpy(MemFileName,TmpFileName_);
try {
MemFile=
new cFile(MemFileName,cFile::CreateAlways,cFile::ReadWrite);
}
catch (cException e)
{
throw e;
}
try {
MemFile->SetFilePointer(
(int64_t)(RecCount*KeySize),cFile::FileBegin);
}
catch (cException e)
{
throw e;
}
if (!(MemFile->SetEndOfFile()))
throw cException(ERR_FILEOPEN,MemFileName);
KeyBuffer=new char[KeySize];
}
}
char *cKeyArray::GetKey(long Record_) {
memset(ReturnBuffer,0,MatchCodeSize + 1);
if (InMemory) {
memcpy(ReturnBuffer,MemBuffer+Record_*KeySize,MatchCodeSize);
} else {
MemFile->SetFilePointer((int64_t)(Record_*KeySize),cFile::FileBegin);
MemFile->ReadFile(KeyBuffer,KeySize);
memcpy(ReturnBuffer,KeyBuffer,MatchCodeSize);
}
return ReturnBuffer;
}
uint32_t cKeyArray::GetDupeGroup(long Record_) {
uint32_t DupeGroup(0);
if (InMemory) {
memcpy((char*)&DupeGroup,
MemBuffer+Record_*KeySize + MatchCodeSize,sizeof(uint32_t));
} else {
MemFile->SetFilePointer(
(int64_t)(Record_*KeySize + MatchCodeSize) ,cFile::FileBegin);
MemFile->ReadFile((char*)&DupeGroup,sizeof(uint32_t));
}
return DupeGroup;
}
On Linux, the OS aggressively caches file data in main memory -- so although you haven't explicitly allocated memory for the file contents, they are nevertheless stored in RAM. Here's a decent link with some more information about the page cache -- only one thing is missing from that description, which is that most Linux filesystems actually implement the standard I/O interfaces as thin wrappers around the page cache. That means that even though you haven't explicitly memory mapped the file, the system is still treating it as though it were memory mapped under-the-covers. That's why you see roughly equivalent performance with either approach.
I second the suggestion to factor the platform-specific stuff out, and use whichever appropach is fastest for each platform. Be sure to benchmark -- don't ever make assumptions about performance.
Your memory-mapped solution should be as much as 10x faster than the file solution even in Linux. That is the speed I experience in my test cases.
Each file access system call takes hundreds of CPU cycles to complete. Time which your program could be using to do real work.
One explanation for why the speeds are similar could be that your memory map has not been used before. When a memory mapped page is accessed for the first time it must be assigned to a physical page of RAM and zeroed out or if it is a disk file it must be loaded from disk into RAM. All of that takes a considerable amount of time.
If you touch (read or write a value) each 4K of RAM before using it you should see a significant speed increase in the memory map.
I am writing a routine to compare two files using memory-mapped file. In case files are too big to be mapped at one go. I split the files and map them part by part. For example, to map a 1049MB file, I split it into 512MB + 512MB + 25MB.
Every thing works fine except one thing: it always take much, much longer to compare the remainder (25MB in this example), though the code logic is exactly the same. 3 observations:
it does not matter which is compared first, whether the main part (512MB * N) or the remainder (25MB in this example) comes first, the result remains the same
the extra time in the remainder seems to be spent in the user mode
Profiling in VS2010 beta 1 shows, the time is spent inside t std::_Equal(), but this function is mostly (profiler says 100%) waiting for I/O and other threads.
I tried
changing the VIEW_SIZE_FACTOR to another value
replacing the lambda functor with a member function
changing the file size under test
changing the order of execution of the remainder to before/after the loop
The result was quite consistent: it takes a lot more time in the remainder part and in the User Mode.
I suspect it has something to do with the fact that the mapped size is not a multiple of mapping alignment (64K on my system), but not sure how.
Below is the complete code for the routine and a timing measured for a 3G file.
Can anyone please explain it, Thanks?
// using memory-mapped file
template <size_t VIEW_SIZE_FACTOR>
struct is_equal_by_mmapT
{
public:
bool operator()(const path_type& p1, const path_type& p2)
{
using boost::filesystem::exists;
using boost::filesystem::file_size;
try
{
if(!(exists(p1) && exists(p2))) return false;
const size_t segment_size = mapped_file_source::alignment() * VIEW_SIZE_FACTOR;
// lanmbda
boost::function<bool(size_t, size_t)> segment_compare =
[&](size_t seg_size, size_t offset)->bool
{
using boost::iostreams::mapped_file_source;
boost::chrono::run_timer t;
mapped_file_source mf1, mf2;
mf1.open(p1, seg_size, offset);
mf2.open(p2, seg_size, offset);
if(! (mf1.is_open() && mf2.is_open())) return false;
if(!equal (mf1.begin(), mf1.end(), mf2.begin())) return false;
return true;
};
boost::uintmax_t size = file_size(p1);
size_t round = size / segment_size;
size_t remainder = size & ( segment_size - 1 );
// compare the remainder
if(remainder > 0)
{
cout << "segment size = "
<< remainder
<< " bytes for the remaining round";
if(!segment_compare(remainder, segment_size * round)) return false;
}
//compare the main part. take much less time, even
for(size_t i = 0; i < round; ++i)
{
cout << "segment size = "
<< segment_size
<< " bytes, round #" << i;
if(!segment_compare(segment_size, segment_size * i)) return false;
}
}
catch(std::exception& e)
{
cout << e.what();
return false;
}
return true;
}
};
typedef is_equal_by_mmapT<(8<<10)> is_equal_by_mmap; // 512MB
output:
segment size = 354410496 bytes for the remaining round
real 116.892s, cpu 56.201s (48.1%), user 54.548s, system 1.652s
segment size = 536870912 bytes, round #0
real 72.258s, cpu 2.273s (3.1%), user 0.320s, system 1.953s
segment size = 536870912 bytes, round #1
real 75.304s, cpu 1.943s (2.6%), user 0.240s, system 1.702s
segment size = 536870912 bytes, round #2
real 84.328s, cpu 1.783s (2.1%), user 0.320s, system 1.462s
segment size = 536870912 bytes, round #3
real 73.901s, cpu 1.702s (2.3%), user 0.330s, system 1.372s
More observations after the suggestions by responders
Further split the remainder into body and tail(remainder = body + tail), where
body = N * alignment(), and tail < 1 * alignment()
body = m * alignment(), and tail < 1 * alignment() + n * alignment(), where m is even.
body = m * alignment(), and tail < 1 * alignment() + n * alignment(), where m is exponents of 2.
body = N * alignment(), and tail = remainder - body. N is random.
the total time remains unchanged, but I can see that the time does not necessary relate to tail, but to size of body and tail. the bigger part takes more time. The time is USER TIME, which is most incomprehensible to me.
I also look at the pages faults through Procexp.exe. the remainder does NOT take more faults than the main loop.
Updates 2
I've performed some test on other workstations, and it seem the issue is related to the hardware configuration.
Test Code
// compare the remainder, alternative way
if(remainder > 0)
{
//boost::chrono::run_timer t;
cout << "Remainder size = "
<< remainder
<< " bytes \n";
size_t tail = (alignment_size - 1) & remainder;
size_t body = remainder - tail;
{
boost::chrono::run_timer t;
cout << "Remainder_tail size = " << tail << " bytes";
if(!segment_compare(tail, segment_size * round + body)) return false;
}
{
boost::chrono::run_timer t;
cout << "Remainder_body size = " << body << " bytes";
if(!segment_compare(body, segment_size * round)) return false;
}
}
Observation:
On another 2 PCs with the same h/w configurations with mine, the result is consistent as following:
------VS2010Beta1ENU_VSTS.iso [1319909376 bytes] ------
Remainder size = 44840960 bytes
Remainder_tail size = 14336 bytes
real 0.060s, cpu 0.040s (66.7%), user 0.000s, system 0.040s
Remainder_body size = 44826624 bytes
real 13.601s, cpu 7.731s (56.8%), user 7.481s, system 0.250s
segment size = 67108864 bytes, total round# = 19
real 172.476s, cpu 4.356s (2.5%), user 0.731s, system 3.625s
However, running the same code on a PC with a different h/w configuration yielded:
------VS2010Beta1ENU_VSTS.iso [1319909376 bytes] ------
Remainder size = 44840960 bytes
Remainder_tail size = 14336 bytes
real 0.013s, cpu 0.000s (0.0%), user 0.000s, system 0.000s
Remainder_body size = 44826624 bytes
real 2.468s, cpu 0.188s (7.6%), user 0.047s, system 0.141s
segment size = 67108864 bytes, total round# = 19
real 65.587s, cpu 4.578s (7.0%), user 0.844s, system 3.734s
System Info
My workstation yielding imcomprehensible timing:
OS Name: Microsoft Windows XP Professional
OS Version: 5.1.2600 Service Pack 3 Build 2600
OS Manufacturer: Microsoft Corporation
OS Configuration: Member Workstation
OS Build Type: Uniprocessor Free
Original Install Date: 2004-01-27, 23:08
System Up Time: 3 Days, 2 Hours, 15 Minutes, 46 Seconds
System Manufacturer: Dell Inc.
System Model: OptiPlex GX520
System type: X86-based PC
Processor(s): 1 Processor(s) Installed.
[01]: x86 Family 15 Model 4 Stepping 3 GenuineIntel ~2992 Mhz
BIOS Version: DELL - 7
Windows Directory: C:\WINDOWS
System Directory: C:\WINDOWS\system32
Boot Device: \Device\HarddiskVolume2
System Locale: zh-cn;Chinese (China)
Input Locale: zh-cn;Chinese (China)
Time Zone: (GMT+08:00) Beijing, Chongqing, Hong Kong, Urumqi
Total Physical Memory: 3,574 MB
Available Physical Memory: 1,986 MB
Virtual Memory: Max Size: 2,048 MB
Virtual Memory: Available: 1,916 MB
Virtual Memory: In Use: 132 MB
Page File Location(s): C:\pagefile.sys
NetWork Card(s): 3 NIC(s) Installed.
[01]: VMware Virtual Ethernet Adapter for VMnet1
Connection Name: VMware Network Adapter VMnet1
DHCP Enabled: No
IP address(es)
[01]: 192.168.75.1
[02]: VMware Virtual Ethernet Adapter for VMnet8
Connection Name: VMware Network Adapter VMnet8
DHCP Enabled: No
IP address(es)
[01]: 192.168.230.1
[03]: Broadcom NetXtreme Gigabit Ethernet
Connection Name: Local Area Connection 4
DHCP Enabled: Yes
DHCP Server: 10.8.0.31
IP address(es)
[01]: 10.8.8.154
Another workstation yielding "correct" timing:
OS Name: Microsoft Windows XP Professional
OS Version: 5.1.2600 Service Pack 3 Build 2600
OS Manufacturer: Microsoft Corporation
OS Configuration: Member Workstation
OS Build Type: Multiprocessor Free
Original Install Date: 5/18/2009, 2:28:18 PM
System Up Time: 21 Days, 5 Hours, 0 Minutes, 49 Seconds
System Manufacturer: Dell Inc.
System Model: OptiPlex 755
System type: X86-based PC
Processor(s): 1 Processor(s) Installed.
[01]: x86 Family 6 Model 15 Stepping 13 GenuineIntel ~2194 Mhz
BIOS Version: DELL - 15
Windows Directory: C:\WINDOWS
System Directory: C:\WINDOWS\system32
Boot Device: \Device\HarddiskVolume1
System Locale: zh-cn;Chinese (China)
Input Locale: en-us;English (United States)
Time Zone: (GMT+08:00) Beijing, Chongqing, Hong Kong, Urumqi
Total Physical Memory: 3,317 MB
Available Physical Memory: 1,682 MB
Virtual Memory: Max Size: 2,048 MB
Virtual Memory: Available: 2,007 MB
Virtual Memory: In Use: 41 MB
Page File Location(s): C:\pagefile.sys
NetWork Card(s): 3 NIC(s) Installed.
[01]: Intel(R) 82566DM-2 Gigabit Network Connection
Connection Name: Local Area Connection
DHCP Enabled: Yes
DHCP Server: 10.8.0.31
IP address(es)
[01]: 10.8.0.137
[02]: VMware Virtual Ethernet Adapter for VMnet1
Connection Name: VMware Network Adapter VMnet1
DHCP Enabled: Yes
DHCP Server: 192.168.154.254
IP address(es)
[01]: 192.168.154.1
[03]: VMware Virtual Ethernet Adapter for VMnet8
Connection Name: VMware Network Adapter VMnet8
DHCP Enabled: Yes
DHCP Server: 192.168.2.254
IP address(es)
[01]: 192.168.2.1
Any explanation theory? Thanks.
This behavior looks quite illogical. I wonder what would happen if we tried something stupid. Provided the overall file is larger than 512MB you could compare again a full 512MB for the last part instead of the remaining size.
something like:
if(remainder > 0)
{
cout << "segment size = "
<< remainder
<< " bytes for the remaining round";
if (size > segment_size){
block_size = segment_size;
offset = size - segment_size;
}
else{
block_size = remainder;
offset = segment_size * i
}
if(!segment_compare(block_size, offset)) return false;
}
It seems a really dumb thing to do because we would be comparing part of the file two times but if your profiling figures are accurate it should be faster.
It won't give us an answer (yet) but if it is indeed faster it means the response we are looking for lies in what your program does for small blocks of data.
How fragmented is the file you are comparing with? You can use FSCTL_GET_RETRIEVAL_POINTERS to get the ranges that the file maps to on disk. I suspect the last 25 MB will have a lot of small ranges to account for the performance you have measured.
I wonder if mmap behaves strangely when a segment isn't an even number of pages in size? Maybe you can try handling the last parts of the file by progressively halving your segment sizes until you get to a size that's less than mapped_file_source::alignment() and handling that last little bit specially.
Also, you say you're doing 512MB blocks, but your code sets the size to 8<<10. It then multiplies that by mapped_file_source::alignment(). Is mapped_file_source::alignment() really 65536?
I would recommend, to be more portable and cause less confusion, that you simply use the size as given in the template parameter and simply require that it be an even multiple of mapped_file_source::alignment() in your code. Or have people pass in the power of two to start at for the block size, or something. Having the block size passed in as a template parameter then be multiplied by some strange implementation defined constant seems a little odd.
I know this isn't an exact answer to your question; but have you tried side-stepping the entire problem - i.e. just map the entire file in one go?
I know little about Win32 memory management; but on Linux you can use the MAP_NORESERVE flag with mmap(), so you don't need to reserve RAM for the entire filesize. Considering you are just reading from both files the OS should be able to throw away pages at any time if it gets short of RAM...
I would try it on a Linux or BSD just to see how it acts, out of curiousity.
I have a really rough guess about the problem:
I bet that Windows is doing a lot of extra checks to make sure it doesn't map past the end of the file. In the past there have been security problems in some OS's that allowed a mmap user to view filesystem-private data or data from other files in the area just past the end of the map, so being careful here is a good idea for a OS designer. So Windows may be using a much more careful "copy data from disk to kernel, zero out unmapped data, copy data to user" instead of the much faster "copy data from disk to user".
Try mapping to just under the end of the file, excluding the last bytes that don't fit into a 64K block.
Could it be that a virus scanner is causing these strange results? Have you tried without virus scanner?
Regards,
Sebastiaan