OpenCV: is OpenCL actually enabled? - c++

I'm trying to speed up the research of features of stitching algorithm using OpenCL. I'm using the code of the example provided here: https://github.com/opencv/opencv/blob/master/samples/cpp/stitching_detailed.cpp
I read online that the only thing I have to do is change Mat to Umat. I did it.
However I am not sure my code is actually using OpenCL.
First: I'm working on an Ubuntu 16.04
Virtual Machine using Parallels
Desktop on a Macbook Pro. Therefore
the only device supported by OpenCL
will be the CPU (no GPU). I installed the correct drivers, sdk, etc and CPU should work correctly. You can see the result of command "clinfo" in Ubuntu shell below. Working with a CPU, I do not
expect performance improvement. My
plan is just to work my virtual
machine and than deploy the code on a
real Ubuntu machine.
Second: the code has no improvement (as expected, see above). Actually the required time seems to be the same. Is it right? I mean, I know I am working still on the CPU, but I expected to be some differences. Moreover looking at the call graph profiled with grof and gprof2dot there no differences (for ones who have never heard about gprof, it is simply a code profiler that can generate a call graph showing all calls among functions: which function calls what other function, and so on). Is it possible? OpenCV with and without OpenCL should call exactly the same function?
How can I be sure the code is actually working with OpenCL? I read online there were some bugs in features finding on OpenCL and therefore I would like to check myself. Moreover, obviously, I would like to work and edit the code, this is just the beginning.
I'm using this code to check if OpenCL is working:
void checkOpenCL() {
if (!cv::ocl::haveOpenCL())
{
cout << "OpenCL is not available..." << endl;
//return;
}
cv::ocl::Context context;
if (!context.create(cv::ocl::Device::TYPE_ALL))
{
cout << "Failed creating the context..." << endl;
//return;
}
cout << context.ndevices() << " CPU devices are detected." << endl; //This bit provides an overview of the OpenCL devices you have in your computer
for (int i = 0; i < context.ndevices(); i++)
{
cv::ocl::Device device = context.device(i);
cout << "name: " << device.name() << endl;
cout << "available: " << device.available() << endl;
cout << "imageSupport: " << device.imageSupport() << endl;
cout << "OpenCL_C_Version: " << device.OpenCL_C_Version() << endl;
cout << endl;
}
cv::ocl::Device(context.device(0)); //Here is where you change which GPU to use (e.g. 0 or 1)
}
And it prints:
1 CPU devices are detected.
name: Intel(R) Core(TM) i7-4850HQ CPU # 2.30GHz
available: 1
imageSupport: 1
OpenCL_C_Version: OpenCL C 1.2
Running clinfo in Ubuntu shell report
Number of platforms 1
Platform Name Intel(R) OpenCL
Platform Vendor Intel(R) Corporation
Platform Version OpenCL 1.2 LINUX
Platform Profile FULL_PROFILE
Platform Extensions cl_khr_icd cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_depth_images cl_khr_3d_image_writes cl_intel_exec_by_local_thread cl_khr_spir cl_khr_fp64
Platform Extensions function suffix INTEL
Platform Name Intel(R) OpenCL
Number of devices 1
Device Name Intel(R) Core(TM) i7-4850HQ CPU # 2.30GHz
Device Vendor Intel(R) Corporation
Device Vendor ID 0x8086
Device Version OpenCL 1.2 (Build 25)
Driver Version 1.2.0.25
Device OpenCL C Version OpenCL C 1.2
Device Type CPU
Device Profile FULL_PROFILE
Max compute units 4
Max clock frequency 2300MHz
Device Partition (core)
Max number of sub-devices 4
Supported partition types by counts, equally, by names (Intel)
Max work item dimensions 3
Max work item sizes 8192x8192x8192
Max work group size 8192
Preferred work group size multiple 128
Preferred / native vector sizes
char 1 / 32
short 1 / 16
int 1 / 8
long 1 / 4
half 0 / 0 (n/a)
float 1 / 8
double 1 / 4 (cl_khr_fp64)
Half-precision Floating-point support (n/a)
Single-precision Floating-point support (core)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero No
Round to infinity No
IEEE754-2008 fused multiply-add No
Support is emulated in software No
Correctly-rounded divide and sqrt operations No
Double-precision Floating-point support (cl_khr_fp64)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Correctly-rounded divide and sqrt operations No
Address bits 64, Little-Endian
Global memory size 6103834624 (5.685GiB)
Error Correction support No
Max memory allocation 1525958656 (1.421GiB)
Unified memory for Host and Device Yes
Minimum alignment for any data type 128 bytes
Alignment of base address 1024 bits (128 bytes)
Global Memory cache type Read/Write
Global Memory cache size 262144
Global Memory cache line 64 bytes
Image support Yes
Max number of samplers per kernel 480
Max size for 1D images from buffer 95372416 pixels
Max 1D or 2D image array size 2048 images
Max 2D image size 16384x16384 pixels
Max 3D image size 2048x2048x2048 pixels
Max number of read image args 480
Max number of write image args 480
Local memory type Global
Local memory size 32768 (32KiB)
Max constant buffer size 131072 (128KiB)
Max number of constant args 480
Max size of kernel argument 3840 (3.75KiB)
Queue properties
Out-of-order execution Yes
Profiling Yes
Local thread execution (Intel) Yes
Prefer user sync for interop No
Profiling timer resolution 1ns
Execution capabilities
Run OpenCL kernels Yes
Run native kernels Yes
SPIR versions 1.2
printf() buffer size 1048576 (1024KiB)
Built-in kernels
Device Available Yes
Compiler Available Yes
Linker Available Yes
Device Extensions cl_khr_icd cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_depth_images cl_khr_3d_image_writes cl_intel_exec_by_local_thread cl_khr_spir cl_khr_fp64
NULL platform behavior
clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...) No platform
clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...) No platform
clCreateContext(NULL, ...) [default] No platform
clCreateContext(NULL, ...) [other] Success [INTEL]
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU) No platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU) No platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR) No platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM) No platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL) No platform

Compile and run the facedetect.cpp from the opencv's T-API folder in the samples folder, it will tell you if you have OpenCL ON/OFF, displayed on the rendered window.
Here is how to run the command:
./ufacedetect --cascade=/home/user/opencv/data/haarcascades/haarcascade_frontalface_alt.xml /dev/video5 --scale=1.3

Related

pthread multithreading in Mac OS vs windows multithreaing

I've developed a multi platform program (using the FLTK toolkit) and implement multithreading to do background intensive tasks.
I have followed the FLTK tutorials/examples on multithreading which involve using pthread on Mac, ie the function pthread_create and windows threading on windows ie _beginthread
What I have noticed is that the performance is much higher on Windows ie 3 to 4 times faster in these background threads (in the time to execute them).
Why might this be? Is it the threading libraries I'm using? Surely there shouldn't be such a difference? Or could it be the runtime libraries underneath it all?
Here are my machine details
Mac:
Intel(R) Core(TM) i7-3820QM CPU # 2.70GHz
16 GB DDR3 1600 MHz
Model MacBookPro9,1
OS: Mac OSX 10.8.5
Windows:
Intel(R) Core(TM) i7-3520M CPU # 2.90GHz
16 GB DDR3 1600 MHz
Model: Dell Latitude E5530
OS: Windows 7 Service Pack 1
EDIT
To just do a basic speed comparison I compiled this on both machines running from the command line
int main(int agrc, char **argv)
{
time_t t = time(NULL);
tm* tt=localtime(&t);
std::stringstream s;
s<< std::setfill('0')<<std::setw(2)<<tt->tm_mday<<"/"<<std::setw(2)<<tt->tm_mon+1<<"/"<< std::setw(4)<<tt->tm_year+1900<<" "<< std::setw(2)<<tt->tm_hour<<":"<<std::setw(2)<<tt->tm_min<<":"<<std::setw(2)<<tt->tm_sec;
std::cout<<"1: "<<s.str()<<std::endl;
double sum=0;
for (int i=0;i<100000000;i++){
double ii=i*0.123456789;
sum=sum+sin(ii)*cos(ii);
}
t = time(NULL);
tt=localtime(&t);
s.str("");
s<< std::setfill('0')<<std::setw(2)<<tt->tm_mday<<"/"<<std::setw(2)<<tt->tm_mon+1<<"/"<< std::setw(4)<<tt->tm_year+1900<<" "<< std::setw(2)<<tt->tm_hour<<":"<<std::setw(2)<<tt->tm_min<<":"<<std::setw(2)<<tt->tm_sec;
std::cout<<"2: "<<s.str()<<std::endl;
}
Windows takes less than a second. Mac takes 4/5 seconds. Any ideas?
On Mac I'm compiling with g++, with visual studio 2013 on windows.
SECOND EDIT
if I change the line
std::cout<<"2: "<<s.str()<<std::endl;
to
std::cout<<"2: "<<s.str()<<" "<<sum<<std::endl;
Then all of a sudden Windows takes a little bit longer...
This makes me think that the whole thing might be compiler optimisation. So the question would be is g++ (4.2 is the version I have) worse at optimisation or do I need to provide additional flags?
THIRD(!) AND FINAL EDIT
I can report that I achieve comparable performance by ensuring g++ optimisation flags -O were provided at compile time. One of those annoying things that happens so often
A: Im tearing my hair out on problem x
B: Are you sure you're not doing y?
A: That works, why is this information not plastered all over the place and in every tutorial on problem x on the web?
B: Did you read the manual?
A: No, if I completely read the manual for every single bit of code/program I used I would never actually get round to doing anything...
Meh.

CUDA memory error

I run high-performance calculations on multiple GPUs (two GPUs per machine), currently I test my code on GeForce GTX TITAN. Recently I noticed that random memory errors occur so that I can't rely on the outcome anymore. Tried to debug and ran into things I don't understand. I'd appreciate if someone could help me understand why the following is happening.
So, here's my GPU:
$ nvidia-smi -a
Driver Version : 331.67
GPU 0000:03:00.0
Product Name : GeForce GTX TITAN
...
VBIOS Version : 80.10.2C.00.02
FB Memory Usage
Total : 6143 MiB
Used : 14 MiB
Free : 6129 MiB
Ecc Mode
Current : N/A
Pending : N/A
My Linux machine (Ubuntu 12.04 64-bit):
$ uname -a
Linux cluster-cn-211 3.2.0-61-generic #93-Ubuntu SMP Fri May 2 21:31:50 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
Here's my code (basically, allocate 4G of memory, fill with zeros, copy back to host and check if all values are zero; spoiler: they're not)
#include <cstdio>
#define check(e) {if (e != cudaSuccess) { \
printf("%d: %s\n", e, cudaGetErrorString(e)); \
return 1; }}
int main() {
size_t num = 1024*1024*1024; // 1 billion elements
size_t size = num * sizeof(float); // 4 GB of memory
float *dp;
float *p = new float[num];
cudaError_t e;
e = cudaMalloc((void**)&dp, size); // allocate
check(e);
e = cudaMemset(dp, 0, size); // set to zero
check(e);
e = cudaMemcpy(p, dp, size, cudaMemcpyDeviceToHost); // copy back
check(e);
for(size_t i=0; i<num; i++) {
if (p[i] != 0) // this should never happen, amiright?
printf("%lu %f\n", i, p[i]);
}
return 0;
}
I run it like this
$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2013 NVIDIA Corporation
Built on Sat_Jan_25_17:33:19_PST_2014
Cuda compilation tools, release 6.0, V6.0.1
$ nvcc test.cu
nvcc warning : The 'compute_10' and 'sm_10' architectures are deprecated, and may be removed in a future release.
$ ./a.out | head
516836128 -0.000214
516836164 -0.841684
516836328 -3272.289062
516836428 -644673853950867887966360388719607808.000000
516836692 0.000005
516850472 232680927002624.000000
516850508 909806289566040064.000000
...
$ echo $?
0
This is not what I expected: many elements are non-zero. Here a couple of observations
I checked with cuda-memcheck - no errors. Checked with valgrind's memcheck - no errors.
the memory allocation works as expected, nvidia-smi reports 4179MiB / 6143MiB
the same happens if I
allocate less memory (e.g. 2 GB)
compile with -arch sm_30 or -arch compute_30 (see capabilities)
go from SDK version 6.0 back to 5.5
go from GTX Titan to Tesla K20c (here the ECC checking is enabled and all counters are zero); behavior is the same, I was able to test it on five different GPU cards.
allocate multiple smaller arrays on the device
the errors disappear if I test on a GTX 680
Again, the question is: why do I see those memory errors and how can I ensure that this never happens?
I also perform calculations using CPU and we have found the same issue. We are using the model GeForce GTX 660 Ti.
I have check that the number of errors increases with the time which the GPU has been working.
The problem can be solved by shutting down the computer (it doesn't work if the machine is rebooted), but after some time working the problem starts again.
I have no idea why that happens. I have tried several codes to check the memory and all of them give the same result.
As far as I have checked this problem cannot be avoided, and the only way to be sure that your results are ok is to check the memory after the calculations and to shutdown the machine evey so often. I know this is not a good solution but is the only I have found.

Develop OpenCL application without GPU support on Laptop?

I am new to OpenCL and want to develop a portable hardware independent OpenCL application.
I have ATI Radeon 7670m on my laptop which has OpenCL 1.1 support as per official website.
However this GPU is not listed on the APP SDK system requirements site.
I am interested in using OpenCL for developing only for GPU and not for CPU.
So is there a way I can do this?
I guess, information on site is outdated. 7670 is OpenCL-compatible for sure. In fact, almost all cards of 5xxx series & newer can run OpenCL.
With OpenCL Context you can choose which device use for development (for example, CPU or GPU devices), in your case CL_DEVICE_TYPE_GPU:
cl::Context context(CL_DEVICE_TYPE_GPU, cprops, NULL, NULL, &err);
For example, from official AMD documentation:
int main(void)
{
cl_int err;
cl::vector< cl::Platform > platformList;
cl::Platform::get(&platformList);
checkErr(platformList.size()!=0 ? CL_SUCCESS : -1, "cl::Platform::get");
std::cerr << "Platform number is: " << platformList.size() << std::endl;std::string platformVendor;
platformList[0].getInfo((cl_platform_info)CL_PLATFORM_VENDOR, &platformVendor);
std::cerr << "Platform is by: " << platformVendor << "\n";
cl_context_properties cprops[3] = {CL_CONTEXT_PLATFORM, (cl_context_properties)(platformList[0])(), 0};
cl::Context context(CL_DEVICE_TYPE_GPU, cprops, NULL, NULL, &err);
checkErr(err, "Conext::Context()");
}

Speed difference between AES/CBC encryption and decryption?

I am wondering, theoretically, how much slower would AES/CBC decryption be compared to AES/CBC encryption with the following conditions:
Encryption key of 32 bytes (256 bits);
A blocksize of 16 bytes (128 bits).
The reason that I ask is that I want to know if the decryption speed of an implementation that I have is not abnormally slow. I have done some tests on random memory blocks of different sizes. The results are as follows:
64B:
64KB:
10MB – 520MB:
All data was stored on the internal memory of my system. The application generates the data to encrypt by itself. Virtual memory is disabled on the test PC so that there would not be any I/O calls.
When analyzing the table, does the difference between encryption and decryption imply that my implementation is abnormally slow? Have I done something wrong?
Update:
This test is executed on another pc;
This test is executed with random data;
Crypto++ is used for the AES/CBC encryption and decryption.
The decryption implementation is as follows:
CryptoPP::AES::Decryption aesDecryption(aesKey, ENCRYPTION_KEY_SIZE_AES);
CryptoPP::CBC_Mode_ExternalCipher::Decryption cbcDecryption(aesDecryption, aesIv);
CryptoPP::ArraySink * decSink = new CryptoPP::ArraySink(data, dataSizeMax);
CryptoPP::StreamTransformationFilter stfDecryptor(cbcDecryption, decSink);
stfDecryptor.Put(reinterpret_cast<const unsigned char*>(ciphertext), cipherSize);
stfDecryptor.MessageEnd();
*dataOutputSize = decSink->TotalPutLength();
Update 2:
Added result for 64 byte blocks
As symmetric encryption, encryption and decryption should be fairly close in speed. Not sure about your implementation but there are ways to optimize if you're concerned about how the algorithm was used. In experiments, AES is not the fastest and CBC will add security but slow it down. Here's a comparison, since you're asking about key and block sizes:
When analyzing the table, does the difference between encryption and decryption imply that my implementation is abnormally slow? Have I done something wrong?
Three or four things jump out at me. I kind of agree with #JamesKPolk - the numbers look off. First, crypto libraries are usually bench-marked with CTR mode, not CBC mode. Also see the SUPERCOP benchmarks. Any you have to use cycles-per-byte (cpb) to normalize measurement units across machines. Saying "9 MB/s" without a context means nothing.
Second, we need to know the machine and its CPU frequency. It looks like you are pushing data at 9 MB/s for encryption and 6.5 MB/s for decryption. A modern iCore machine, like a Core-i5 running at 2.7 GHz, will push CBC mode data at around 2.5 or 3.0 cpb, which is roughly 980 MB/s or 1 GB/s. Even my old Core2 Duo running at 2.0 GHz moves data faster than you are showing. The Core 2 moves data at 14.5 cpb or 130 MB/s.
Third, scrap this code. There is a lot of room for improvement so it is not worth critiquing in this context; and the recommended code is below. Worth mentioning is, you are creating a lot of objects like ArraySource and StreamTransformationFilter. The filter adds padding which perturbs the AES encryption and decryption benchmarks and skews the results. You only need an encryption object, and then you only need to call ProcessBlock or ProcessString.
CryptoPP::AES::Decryption aesDecryption(aesKey, ENCRYPTION_KEY_SIZE_AES);
CryptoPP::CBC_Mode_ExternalCipher::Decryption cbcDecryption(aesDecryption, aesIv);
...
Fourth, the Crypto++ wiki has a Benchmarks article with the following code. Its a new section and was not available when you asked your question. Here's how to run your test.
AutoSeededRandomPool prng;
SecByteBlock key(16);
prng.GenerateBlock(key, key.size());
CTR_Mode<AES>::Encryption cipher;
cipher.SetKeyWithIV(key, key.size(), key);
const int BUF_SIZE = RoundUpToMultipleOf(2048U,
dynamic_cast<StreamTransformation&>(cipher).OptimalBlockSize());
AlignedSecByteBlock buf(BUF_SIZE);
prng.GenerateBlock(buf, buf.size());
const double runTimeInSeconds = 3.0;
const double cpuFreq = 2.7 * 1000 * 1000 * 1000;
double elapsedTimeInSeconds;
unsigned long i=0, blocks=1;
ThreadUserTimer timer;
timer.StartTimer();
do
{
blocks *= 2;
for (; i<blocks; i++)
cipher.ProcessString(buf, BUF_SIZE);
elapsedTimeInSeconds = timer.ElapsedTimeAsDouble();
}
while (elapsedTimeInSeconds < runTimeInSeconds);
const double bytes = static_cast<double>(BUF_SIZE) * blocks;
const double ghz = cpuFreq / 1000 / 1000 / 1000;
const double mbs = bytes / 1024 / 1024 / elapsedTimeInSeconds;
const double cpb = elapsedTimeInSeconds * cpuFreq / bytes;
std::cout << cipher.AlgorithmName() << " benchmarks..." << std::endl;
std::cout << " " << ghz << " GHz cpu frequency" << std::endl;
std::cout << " " << cpb << " cycles per byte (cpb)" << std::endl;
std::cout << " " << mbs << " MiB per second (MiB)" << std::endl;
Running the code on a Core-i5 6400 at 2.7 GHz results in:
$ ./bench.exe
AES/CTR benchmarks...
2.7 GHz cpu frequency
0.58228 cycles per byte (cpb)
4422.13 MiB per second (MiB)
Fifth, when I modify the benchmark program shown above to operate of 64-byte blocks:
const int BUF_SIZE = 64;
unsigned int blocks = 0;
...
do
{
blocks++;
cipher.ProcessString(buf, BUF_SIZE);
elapsedTimeInSeconds = timer.ElapsedTimeAsDouble();
}
while (elapsedTimeInSeconds < runTimeInSeconds);
I see 3.4 cpb or 760 MB/s for the Core-i5 6400 at 2.7 GHz for 64-byte blocks. The library takes a hit for small buffers, but most (all?) libraries do.
$ ./bench.exe
AES/CTR benchmarks...
2.7 GHz cpu frequency
3.39823 cycles per byte (cpb)
757.723 MiB per second (MiB)
Sixth, you need to get the processor out of powersave mode or a low energy state for best/most consistent results. The library uses governor.sh to do it on Linux. It is available in the TestScript/ directory.
Seventh, when I switch to CTR mode decryption with:
CTR_Mode<AES>::Decryption cipher;
cipher.SetKeyWithIV(key, key.size(), key);
Then I see about the same rate for bulk decryption:
$ ./bench.exe
AES/CTR benchmarks...
2.7 GHz cpu frequency
0.579923 cycles per byte (cpb)
4440.11 MiB per second (MiB)
Eigth, here is a collection of benchmark numbers on a bunch of different machines. It should provide a rough target as you are tuning your tests.
My Beaglebone dev-board running at 980 MHz is moving data at twice the rate you are reporting. The Beaglebone achieves a boring 40 cpb with 20 MB/s because it is staright C/C++; and not optimized for A-32.
Skylake Core-i5 #2.7 GHz
https://www.cryptopp.com/benchmarks.html
Core2 Duo # 2.0 GHz
https://www.cryptopp.com/benchmarks-core2.html
LeMaker HiKey Kirik SoC Aarch64 #1.2 GHz
https://www.cryptopp.com/benchmarks-hikey.html
AMD Opteron Aarch64 #2.0 GHz
https://www.cryptopp.com/benchmarks-opteron.html
BananaPi Cortex-A7 dev-board #860MHz
https://www.cryptopp.com/benchmarks-bananapi.html
IBM Power8 Server #4.1 GHz
https://www.cryptopp.com/benchmarks-power8.html
My takeaways are:
CTR mode bulk encryption and decryption are about the same on modern machines
CTR mode key setup are not the same, decryption takes a little longer on modern machines
Small block sizes are more expensive than large block sizes
It is everything I expect to see.
I think your next step is to collect some data using the sample program on the Crypto++ wiki and then evaluate the results.
Theoretically, AES decryption is 30% slower. This is a property of Rijndael systems in general.
Source: http://www4.ncsu.edu/~hartwig/Teaching/437/aes.pdf

incomprehensible time consumed in using memory mapped file

I am writing a routine to compare two files using memory-mapped file. In case files are too big to be mapped at one go. I split the files and map them part by part. For example, to map a 1049MB file, I split it into 512MB + 512MB + 25MB.
Every thing works fine except one thing: it always take much, much longer to compare the remainder (25MB in this example), though the code logic is exactly the same. 3 observations:
it does not matter which is compared first, whether the main part (512MB * N) or the remainder (25MB in this example) comes first, the result remains the same
the extra time in the remainder seems to be spent in the user mode
Profiling in VS2010 beta 1 shows, the time is spent inside t std::_Equal(), but this function is mostly (profiler says 100%) waiting for I/O and other threads.
I tried
changing the VIEW_SIZE_FACTOR to another value
replacing the lambda functor with a member function
changing the file size under test
changing the order of execution of the remainder to before/after the loop
The result was quite consistent: it takes a lot more time in the remainder part and in the User Mode.
I suspect it has something to do with the fact that the mapped size is not a multiple of mapping alignment (64K on my system), but not sure how.
Below is the complete code for the routine and a timing measured for a 3G file.
Can anyone please explain it, Thanks?
// using memory-mapped file
template <size_t VIEW_SIZE_FACTOR>
struct is_equal_by_mmapT
{
public:
bool operator()(const path_type& p1, const path_type& p2)
{
using boost::filesystem::exists;
using boost::filesystem::file_size;
try
{
if(!(exists(p1) && exists(p2))) return false;
const size_t segment_size = mapped_file_source::alignment() * VIEW_SIZE_FACTOR;
// lanmbda
boost::function<bool(size_t, size_t)> segment_compare =
[&](size_t seg_size, size_t offset)->bool
{
using boost::iostreams::mapped_file_source;
boost::chrono::run_timer t;
mapped_file_source mf1, mf2;
mf1.open(p1, seg_size, offset);
mf2.open(p2, seg_size, offset);
if(! (mf1.is_open() && mf2.is_open())) return false;
if(!equal (mf1.begin(), mf1.end(), mf2.begin())) return false;
return true;
};
boost::uintmax_t size = file_size(p1);
size_t round = size / segment_size;
size_t remainder = size & ( segment_size - 1 );
// compare the remainder
if(remainder > 0)
{
cout << "segment size = "
<< remainder
<< " bytes for the remaining round";
if(!segment_compare(remainder, segment_size * round)) return false;
}
//compare the main part. take much less time, even
for(size_t i = 0; i < round; ++i)
{
cout << "segment size = "
<< segment_size
<< " bytes, round #" << i;
if(!segment_compare(segment_size, segment_size * i)) return false;
}
}
catch(std::exception& e)
{
cout << e.what();
return false;
}
return true;
}
};
typedef is_equal_by_mmapT<(8<<10)> is_equal_by_mmap; // 512MB
output:
segment size = 354410496 bytes for the remaining round
real 116.892s, cpu 56.201s (48.1%), user 54.548s, system 1.652s
segment size = 536870912 bytes, round #0
real 72.258s, cpu 2.273s (3.1%), user 0.320s, system 1.953s
segment size = 536870912 bytes, round #1
real 75.304s, cpu 1.943s (2.6%), user 0.240s, system 1.702s
segment size = 536870912 bytes, round #2
real 84.328s, cpu 1.783s (2.1%), user 0.320s, system 1.462s
segment size = 536870912 bytes, round #3
real 73.901s, cpu 1.702s (2.3%), user 0.330s, system 1.372s
More observations after the suggestions by responders
Further split the remainder into body and tail(remainder = body + tail), where
body = N * alignment(), and tail < 1 * alignment()
body = m * alignment(), and tail < 1 * alignment() + n * alignment(), where m is even.
body = m * alignment(), and tail < 1 * alignment() + n * alignment(), where m is exponents of 2.
body = N * alignment(), and tail = remainder - body. N is random.
the total time remains unchanged, but I can see that the time does not necessary relate to tail, but to size of body and tail. the bigger part takes more time. The time is USER TIME, which is most incomprehensible to me.
I also look at the pages faults through Procexp.exe. the remainder does NOT take more faults than the main loop.
Updates 2
I've performed some test on other workstations, and it seem the issue is related to the hardware configuration.
Test Code
// compare the remainder, alternative way
if(remainder > 0)
{
//boost::chrono::run_timer t;
cout << "Remainder size = "
<< remainder
<< " bytes \n";
size_t tail = (alignment_size - 1) & remainder;
size_t body = remainder - tail;
{
boost::chrono::run_timer t;
cout << "Remainder_tail size = " << tail << " bytes";
if(!segment_compare(tail, segment_size * round + body)) return false;
}
{
boost::chrono::run_timer t;
cout << "Remainder_body size = " << body << " bytes";
if(!segment_compare(body, segment_size * round)) return false;
}
}
Observation:
On another 2 PCs with the same h/w configurations with mine, the result is consistent as following:
------VS2010Beta1ENU_VSTS.iso [1319909376 bytes] ------
Remainder size = 44840960 bytes
Remainder_tail size = 14336 bytes
real 0.060s, cpu 0.040s (66.7%), user 0.000s, system 0.040s
Remainder_body size = 44826624 bytes
real 13.601s, cpu 7.731s (56.8%), user 7.481s, system 0.250s
segment size = 67108864 bytes, total round# = 19
real 172.476s, cpu 4.356s (2.5%), user 0.731s, system 3.625s
However, running the same code on a PC with a different h/w configuration yielded:
------VS2010Beta1ENU_VSTS.iso [1319909376 bytes] ------
Remainder size = 44840960 bytes
Remainder_tail size = 14336 bytes
real 0.013s, cpu 0.000s (0.0%), user 0.000s, system 0.000s
Remainder_body size = 44826624 bytes
real 2.468s, cpu 0.188s (7.6%), user 0.047s, system 0.141s
segment size = 67108864 bytes, total round# = 19
real 65.587s, cpu 4.578s (7.0%), user 0.844s, system 3.734s
System Info
My workstation yielding imcomprehensible timing:
OS Name: Microsoft Windows XP Professional
OS Version: 5.1.2600 Service Pack 3 Build 2600
OS Manufacturer: Microsoft Corporation
OS Configuration: Member Workstation
OS Build Type: Uniprocessor Free
Original Install Date: 2004-01-27, 23:08
System Up Time: 3 Days, 2 Hours, 15 Minutes, 46 Seconds
System Manufacturer: Dell Inc.
System Model: OptiPlex GX520
System type: X86-based PC
Processor(s): 1 Processor(s) Installed.
[01]: x86 Family 15 Model 4 Stepping 3 GenuineIntel ~2992 Mhz
BIOS Version: DELL - 7
Windows Directory: C:\WINDOWS
System Directory: C:\WINDOWS\system32
Boot Device: \Device\HarddiskVolume2
System Locale: zh-cn;Chinese (China)
Input Locale: zh-cn;Chinese (China)
Time Zone: (GMT+08:00) Beijing, Chongqing, Hong Kong, Urumqi
Total Physical Memory: 3,574 MB
Available Physical Memory: 1,986 MB
Virtual Memory: Max Size: 2,048 MB
Virtual Memory: Available: 1,916 MB
Virtual Memory: In Use: 132 MB
Page File Location(s): C:\pagefile.sys
NetWork Card(s): 3 NIC(s) Installed.
[01]: VMware Virtual Ethernet Adapter for VMnet1
Connection Name: VMware Network Adapter VMnet1
DHCP Enabled: No
IP address(es)
[01]: 192.168.75.1
[02]: VMware Virtual Ethernet Adapter for VMnet8
Connection Name: VMware Network Adapter VMnet8
DHCP Enabled: No
IP address(es)
[01]: 192.168.230.1
[03]: Broadcom NetXtreme Gigabit Ethernet
Connection Name: Local Area Connection 4
DHCP Enabled: Yes
DHCP Server: 10.8.0.31
IP address(es)
[01]: 10.8.8.154
Another workstation yielding "correct" timing:
OS Name: Microsoft Windows XP Professional
OS Version: 5.1.2600 Service Pack 3 Build 2600
OS Manufacturer: Microsoft Corporation
OS Configuration: Member Workstation
OS Build Type: Multiprocessor Free
Original Install Date: 5/18/2009, 2:28:18 PM
System Up Time: 21 Days, 5 Hours, 0 Minutes, 49 Seconds
System Manufacturer: Dell Inc.
System Model: OptiPlex 755
System type: X86-based PC
Processor(s): 1 Processor(s) Installed.
[01]: x86 Family 6 Model 15 Stepping 13 GenuineIntel ~2194 Mhz
BIOS Version: DELL - 15
Windows Directory: C:\WINDOWS
System Directory: C:\WINDOWS\system32
Boot Device: \Device\HarddiskVolume1
System Locale: zh-cn;Chinese (China)
Input Locale: en-us;English (United States)
Time Zone: (GMT+08:00) Beijing, Chongqing, Hong Kong, Urumqi
Total Physical Memory: 3,317 MB
Available Physical Memory: 1,682 MB
Virtual Memory: Max Size: 2,048 MB
Virtual Memory: Available: 2,007 MB
Virtual Memory: In Use: 41 MB
Page File Location(s): C:\pagefile.sys
NetWork Card(s): 3 NIC(s) Installed.
[01]: Intel(R) 82566DM-2 Gigabit Network Connection
Connection Name: Local Area Connection
DHCP Enabled: Yes
DHCP Server: 10.8.0.31
IP address(es)
[01]: 10.8.0.137
[02]: VMware Virtual Ethernet Adapter for VMnet1
Connection Name: VMware Network Adapter VMnet1
DHCP Enabled: Yes
DHCP Server: 192.168.154.254
IP address(es)
[01]: 192.168.154.1
[03]: VMware Virtual Ethernet Adapter for VMnet8
Connection Name: VMware Network Adapter VMnet8
DHCP Enabled: Yes
DHCP Server: 192.168.2.254
IP address(es)
[01]: 192.168.2.1
Any explanation theory? Thanks.
This behavior looks quite illogical. I wonder what would happen if we tried something stupid. Provided the overall file is larger than 512MB you could compare again a full 512MB for the last part instead of the remaining size.
something like:
if(remainder > 0)
{
cout << "segment size = "
<< remainder
<< " bytes for the remaining round";
if (size > segment_size){
block_size = segment_size;
offset = size - segment_size;
}
else{
block_size = remainder;
offset = segment_size * i
}
if(!segment_compare(block_size, offset)) return false;
}
It seems a really dumb thing to do because we would be comparing part of the file two times but if your profiling figures are accurate it should be faster.
It won't give us an answer (yet) but if it is indeed faster it means the response we are looking for lies in what your program does for small blocks of data.
How fragmented is the file you are comparing with? You can use FSCTL_GET_RETRIEVAL_POINTERS to get the ranges that the file maps to on disk. I suspect the last 25 MB will have a lot of small ranges to account for the performance you have measured.
I wonder if mmap behaves strangely when a segment isn't an even number of pages in size? Maybe you can try handling the last parts of the file by progressively halving your segment sizes until you get to a size that's less than mapped_file_source::alignment() and handling that last little bit specially.
Also, you say you're doing 512MB blocks, but your code sets the size to 8<<10. It then multiplies that by mapped_file_source::alignment(). Is mapped_file_source::alignment() really 65536?
I would recommend, to be more portable and cause less confusion, that you simply use the size as given in the template parameter and simply require that it be an even multiple of mapped_file_source::alignment() in your code. Or have people pass in the power of two to start at for the block size, or something. Having the block size passed in as a template parameter then be multiplied by some strange implementation defined constant seems a little odd.
I know this isn't an exact answer to your question; but have you tried side-stepping the entire problem - i.e. just map the entire file in one go?
I know little about Win32 memory management; but on Linux you can use the MAP_NORESERVE flag with mmap(), so you don't need to reserve RAM for the entire filesize. Considering you are just reading from both files the OS should be able to throw away pages at any time if it gets short of RAM...
I would try it on a Linux or BSD just to see how it acts, out of curiousity.
I have a really rough guess about the problem:
I bet that Windows is doing a lot of extra checks to make sure it doesn't map past the end of the file. In the past there have been security problems in some OS's that allowed a mmap user to view filesystem-private data or data from other files in the area just past the end of the map, so being careful here is a good idea for a OS designer. So Windows may be using a much more careful "copy data from disk to kernel, zero out unmapped data, copy data to user" instead of the much faster "copy data from disk to user".
Try mapping to just under the end of the file, excluding the last bytes that don't fit into a 64K block.
Could it be that a virus scanner is causing these strange results? Have you tried without virus scanner?
Regards,
Sebastiaan