I am wondering, theoretically, how much slower would AES/CBC decryption be compared to AES/CBC encryption with the following conditions:
Encryption key of 32 bytes (256 bits);
A blocksize of 16 bytes (128 bits).
The reason that I ask is that I want to know if the decryption speed of an implementation that I have is not abnormally slow. I have done some tests on random memory blocks of different sizes. The results are as follows:
64B:
64KB:
10MB – 520MB:
All data was stored on the internal memory of my system. The application generates the data to encrypt by itself. Virtual memory is disabled on the test PC so that there would not be any I/O calls.
When analyzing the table, does the difference between encryption and decryption imply that my implementation is abnormally slow? Have I done something wrong?
Update:
This test is executed on another pc;
This test is executed with random data;
Crypto++ is used for the AES/CBC encryption and decryption.
The decryption implementation is as follows:
CryptoPP::AES::Decryption aesDecryption(aesKey, ENCRYPTION_KEY_SIZE_AES);
CryptoPP::CBC_Mode_ExternalCipher::Decryption cbcDecryption(aesDecryption, aesIv);
CryptoPP::ArraySink * decSink = new CryptoPP::ArraySink(data, dataSizeMax);
CryptoPP::StreamTransformationFilter stfDecryptor(cbcDecryption, decSink);
stfDecryptor.Put(reinterpret_cast<const unsigned char*>(ciphertext), cipherSize);
stfDecryptor.MessageEnd();
*dataOutputSize = decSink->TotalPutLength();
Update 2:
Added result for 64 byte blocks
As symmetric encryption, encryption and decryption should be fairly close in speed. Not sure about your implementation but there are ways to optimize if you're concerned about how the algorithm was used. In experiments, AES is not the fastest and CBC will add security but slow it down. Here's a comparison, since you're asking about key and block sizes:
When analyzing the table, does the difference between encryption and decryption imply that my implementation is abnormally slow? Have I done something wrong?
Three or four things jump out at me. I kind of agree with #JamesKPolk - the numbers look off. First, crypto libraries are usually bench-marked with CTR mode, not CBC mode. Also see the SUPERCOP benchmarks. Any you have to use cycles-per-byte (cpb) to normalize measurement units across machines. Saying "9 MB/s" without a context means nothing.
Second, we need to know the machine and its CPU frequency. It looks like you are pushing data at 9 MB/s for encryption and 6.5 MB/s for decryption. A modern iCore machine, like a Core-i5 running at 2.7 GHz, will push CBC mode data at around 2.5 or 3.0 cpb, which is roughly 980 MB/s or 1 GB/s. Even my old Core2 Duo running at 2.0 GHz moves data faster than you are showing. The Core 2 moves data at 14.5 cpb or 130 MB/s.
Third, scrap this code. There is a lot of room for improvement so it is not worth critiquing in this context; and the recommended code is below. Worth mentioning is, you are creating a lot of objects like ArraySource and StreamTransformationFilter. The filter adds padding which perturbs the AES encryption and decryption benchmarks and skews the results. You only need an encryption object, and then you only need to call ProcessBlock or ProcessString.
CryptoPP::AES::Decryption aesDecryption(aesKey, ENCRYPTION_KEY_SIZE_AES);
CryptoPP::CBC_Mode_ExternalCipher::Decryption cbcDecryption(aesDecryption, aesIv);
...
Fourth, the Crypto++ wiki has a Benchmarks article with the following code. Its a new section and was not available when you asked your question. Here's how to run your test.
AutoSeededRandomPool prng;
SecByteBlock key(16);
prng.GenerateBlock(key, key.size());
CTR_Mode<AES>::Encryption cipher;
cipher.SetKeyWithIV(key, key.size(), key);
const int BUF_SIZE = RoundUpToMultipleOf(2048U,
dynamic_cast<StreamTransformation&>(cipher).OptimalBlockSize());
AlignedSecByteBlock buf(BUF_SIZE);
prng.GenerateBlock(buf, buf.size());
const double runTimeInSeconds = 3.0;
const double cpuFreq = 2.7 * 1000 * 1000 * 1000;
double elapsedTimeInSeconds;
unsigned long i=0, blocks=1;
ThreadUserTimer timer;
timer.StartTimer();
do
{
blocks *= 2;
for (; i<blocks; i++)
cipher.ProcessString(buf, BUF_SIZE);
elapsedTimeInSeconds = timer.ElapsedTimeAsDouble();
}
while (elapsedTimeInSeconds < runTimeInSeconds);
const double bytes = static_cast<double>(BUF_SIZE) * blocks;
const double ghz = cpuFreq / 1000 / 1000 / 1000;
const double mbs = bytes / 1024 / 1024 / elapsedTimeInSeconds;
const double cpb = elapsedTimeInSeconds * cpuFreq / bytes;
std::cout << cipher.AlgorithmName() << " benchmarks..." << std::endl;
std::cout << " " << ghz << " GHz cpu frequency" << std::endl;
std::cout << " " << cpb << " cycles per byte (cpb)" << std::endl;
std::cout << " " << mbs << " MiB per second (MiB)" << std::endl;
Running the code on a Core-i5 6400 at 2.7 GHz results in:
$ ./bench.exe
AES/CTR benchmarks...
2.7 GHz cpu frequency
0.58228 cycles per byte (cpb)
4422.13 MiB per second (MiB)
Fifth, when I modify the benchmark program shown above to operate of 64-byte blocks:
const int BUF_SIZE = 64;
unsigned int blocks = 0;
...
do
{
blocks++;
cipher.ProcessString(buf, BUF_SIZE);
elapsedTimeInSeconds = timer.ElapsedTimeAsDouble();
}
while (elapsedTimeInSeconds < runTimeInSeconds);
I see 3.4 cpb or 760 MB/s for the Core-i5 6400 at 2.7 GHz for 64-byte blocks. The library takes a hit for small buffers, but most (all?) libraries do.
$ ./bench.exe
AES/CTR benchmarks...
2.7 GHz cpu frequency
3.39823 cycles per byte (cpb)
757.723 MiB per second (MiB)
Sixth, you need to get the processor out of powersave mode or a low energy state for best/most consistent results. The library uses governor.sh to do it on Linux. It is available in the TestScript/ directory.
Seventh, when I switch to CTR mode decryption with:
CTR_Mode<AES>::Decryption cipher;
cipher.SetKeyWithIV(key, key.size(), key);
Then I see about the same rate for bulk decryption:
$ ./bench.exe
AES/CTR benchmarks...
2.7 GHz cpu frequency
0.579923 cycles per byte (cpb)
4440.11 MiB per second (MiB)
Eigth, here is a collection of benchmark numbers on a bunch of different machines. It should provide a rough target as you are tuning your tests.
My Beaglebone dev-board running at 980 MHz is moving data at twice the rate you are reporting. The Beaglebone achieves a boring 40 cpb with 20 MB/s because it is staright C/C++; and not optimized for A-32.
Skylake Core-i5 #2.7 GHz
https://www.cryptopp.com/benchmarks.html
Core2 Duo # 2.0 GHz
https://www.cryptopp.com/benchmarks-core2.html
LeMaker HiKey Kirik SoC Aarch64 #1.2 GHz
https://www.cryptopp.com/benchmarks-hikey.html
AMD Opteron Aarch64 #2.0 GHz
https://www.cryptopp.com/benchmarks-opteron.html
BananaPi Cortex-A7 dev-board #860MHz
https://www.cryptopp.com/benchmarks-bananapi.html
IBM Power8 Server #4.1 GHz
https://www.cryptopp.com/benchmarks-power8.html
My takeaways are:
CTR mode bulk encryption and decryption are about the same on modern machines
CTR mode key setup are not the same, decryption takes a little longer on modern machines
Small block sizes are more expensive than large block sizes
It is everything I expect to see.
I think your next step is to collect some data using the sample program on the Crypto++ wiki and then evaluate the results.
Theoretically, AES decryption is 30% slower. This is a property of Rijndael systems in general.
Source: http://www4.ncsu.edu/~hartwig/Teaching/437/aes.pdf
Related
I am running the following piece of code inside an SGX enclave:
void test_enclave_size() {
unsigned int i = 0;
const unsigned int MB = 1024 * 1024;
try {
for (; i < 10000; i++) {
char* tmp = new char[MB];
}
} catch (const std::exception &e) {
std::cout << "Crash with " << e.what() << " " << i << std::endl;
}
}
On my dev machine with the standard 128 MB EPC this throws a bad cxxrt::bad_alloc after 118 MB, which makes sense because I believe only 96 MB is guaranteed to be available for enclave programs. However, when running this code on a Standard_DC32s_V3, which has 192 GB of EPC memory, I get the exact same result. I assumed that because the EPC is advertised to be extremely large, I should be able to allocate far more than 128 MB.
I have thought of a couple of reasons why this might be happening:
While the EPC is now 192 GB in size, each process is still limited to 128 MB.
There is something in the kernel that needs to be enabled for me to take advantage of this large EPC.
I am misunderstanding what Azure is advertising.
I wanted to see if anyone has a good idea of what is happening before contact Azure support, since this might be a user error.
Edit:
It turns out my second reason was the closest. As X99 pointed out, when developing an enclave application there is a configuration file that defines several factors such as the number of thread contexts, whether debugging is enabled, and max heap/stack size. My maximum heap size in my configuration was set to about 118 MB, which explains why I started to get bad allocations past this amount. Once I increased the number, the issue went away. Size note: if you are on Linux, the drivers support paging. This means you can use as much memory as you wish if you can afford to suffer the paging overhead.
If you are using Open Enclave as your SDK, this configuration file (example) is what you should be editing. In this example, the maximum heap and stack are 1024 pages, which is about 4MB. This page may be useful to you as well!
If the machine you're running your enclave on has more than 128Mb of EPC AND will allow you to go further (because of a BIOS setting), there is one more setting you must fiddle with, in your Enclave.config.xml file:
<EnclaveConfiguration>
<ProdID>0</ProdID>
<ISVSVN>0</ISVSVN>
<StackMaxSize>0x40000</StackMaxSize>
<HeapMaxSize>0x100000</HeapMaxSize>
<TCSNum>10</TCSNum>
<TCSPolicy>1</TCSPolicy>
<!-- Recommend changing 'DisableDebug' to 1 to make the enclave undebuggable for enclave release -->
<DisableDebug>0</DisableDebug>
<MiscSelect>0</MiscSelect>
<MiscMask>0xFFFFFFFF</MiscMask>
</EnclaveConfiguration>
To be more specific, the HeapMaxSize value. See the SGX developper reference, page 58/59.
I am trying to send constant and large amount of bytes with SPI from a Linux embedded system -- am335x (Beaglebone: pocketbeagle version). Thing is, I am trying to increase its transfer rate.
Currently, I am accessing SPI through user-space with the following configuration on spidev (ioctl calls):
// SPI init -- (const char *device, int mode, int bits, int speed)
retv = spi_bus.spi_init(LCD_SPI_DEVICE,0,8,100000000);
It's slow! I haven't measured the mb/s, but even at 100MHz it's not sending the right amount.
I read somewhere that the DMA is automatically called when using the mcSPI driver. However, I am not really sure if it applies to user-space drivers such as spidev.
My question is: How can I increase the MB/s transfer data for SPI?
Things I have thought about so far:
1) Look for a kernel=space driver instead of using spidev.
2) Increase the word length.
Not sure though, what would you recommend to significantly increase the transfer rate of SPI?
Using below program I try to test how fast I can write to disk using std::ofstream.
I achieve around 300 MiB/s when writing a 1 GiB file.
However, a simple file copy using the cp command is at least twice as fast.
Is my program hitting the hardware limit or can it be made faster?
#include <chrono>
#include <iostream>
#include <fstream>
char payload[1000 * 1000]; // 1 MB
void test(int MB)
{
// Configure buffer
char buffer[32 * 1000];
std::ofstream of("test.file");
of.rdbuf()->pubsetbuf(buffer, sizeof(buffer));
auto start_time = std::chrono::steady_clock::now();
// Write a total of 1 GB
for (auto i = 0; i != MB; ++i)
{
of.write(payload, sizeof(payload));
}
double elapsed_ns = std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::steady_clock::now() - start_time).count();
double megabytes_per_ns = 1e3 / elapsed_ns;
double megabytes_per_s = 1e9 * megabytes_per_ns;
std::cout << "Payload=" << MB << "MB Speed=" << megabytes_per_s << "MB/s" << std::endl;
}
int main()
{
for (auto i = 1; i <= 10; ++i)
{
test(i * 100);
}
}
Output:
Payload=100MB Speed=3792.06MB/s
Payload=200MB Speed=1790.41MB/s
Payload=300MB Speed=1204.66MB/s
Payload=400MB Speed=910.37MB/s
Payload=500MB Speed=722.704MB/s
Payload=600MB Speed=579.914MB/s
Payload=700MB Speed=499.281MB/s
Payload=800MB Speed=462.131MB/s
Payload=900MB Speed=411.414MB/s
Payload=1000MB Speed=364.613MB/s
Update
I changed from std::ofstream to fwrite:
#include <chrono>
#include <cstdio>
#include <iostream>
char payload[1024 * 1024]; // 1 MiB
void test(int number_of_megabytes)
{
FILE* file = fopen("test.file", "w");
auto start_time = std::chrono::steady_clock::now();
// Write a total of 1 GB
for (auto i = 0; i != number_of_megabytes; ++i)
{
fwrite(payload, 1, sizeof(payload), file );
}
fclose(file); // TODO: RAII
double elapsed_ns = std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::steady_clock::now() - start_time).count();
double megabytes_per_ns = 1e3 / elapsed_ns;
double megabytes_per_s = 1e9 * megabytes_per_ns;
std::cout << "Size=" << number_of_megabytes << "MiB Duration=" << long(0.5 + 100 * elapsed_ns/1e9)/100.0 << "s Speed=" << megabytes_per_s << "MiB/s" << std::endl;
}
int main()
{
test(256);
test(512);
test(1024);
test(1024);
}
Which improves the speed to 668MiB/s for a 1 GiB file:
Size=256MiB Duration=0.4s Speed=2524.66MiB/s
Size=512MiB Duration=0.79s Speed=1262.41MiB/s
Size=1024MiB Duration=1.5s Speed=664.521MiB/s
Size=1024MiB Duration=1.5s Speed=668.85MiB/s
Which is just as fast as dd:
time dd if=/dev/zero of=test.file bs=1024 count=0 seek=1048576
real 0m1.539s
user 0m0.001s
sys 0m0.344s
First, you're not really measuring the disk writing speed, but (partly) the speed of writing data to the OS disk cache. To really measure the disk writing speed, the data should be flushed to disk before calculating the time. Without flushing there could be a difference depending on the file size and the available memory.
There seems to be something wrong in the calculations too. You're not using the value of MB.
Also make sure the buffer size is a power of two, or at least a multiple of the disk page size (4096 bytes): char buffer[32 * 1024];. You might as well do that for payload too. (looks like you changed that from 1024 to 1000 in an edit where you added the calculations).
Do not use streams to write a (binary) buffer of data to disk, but instead write directly to the file, using FILE*, fopen(), fwrite(), fclose(). See this answer for an example and some timings.
To copy a file: open the source file in read-only and, if possible, forward-only mode, and using fread(), fwrite():
while fread() from source to buffer
fwrite() buffer to destination file
This should give you a speed comparable to the speed of an OS file copy (you might want to test some different buffer sizes).
This might be slightly faster using memory mapping:
open src, create memory mapping over the file
open/create dest, set file size to size of src, create memory mapping over the file
memcpy() src to dest
For large files smaller mapped views should be used.
Streams are slow
cp uses syscalls directly read(2) or mmap(2).
I'd wager that it's something clever inside either CP or the filesystem. If it's inside CP then it might be that the file that you are copying has a lot of 0s in it and cp is detecting this and writing a sparse version of your file. The man page for cp says "By default, sparse SOURCE files are detected by a crude heuristic and the corresponding DEST file is made sparse as well." This could mean a few things but one of them is that cp could make a sparse version of your file which would require less disk write time.
If it's within your filesystem then it might be Deduplication.
As a long-shot 3rd, it might also be something within your OS or your disk firmware that is translating the read and write into some specialized instruction that doesn't require as much synchronization as your program requires (lower bus use means less latency).
You're using a relatively small buffer size. Small buffers mean more operations per second, which increases overhead. Disk systems have a small amount of latency before they receive the read/write request and begin processing it; a larger buffer amortizes that cost a little better. A smaller buffer may also mean that the disk is spending more time seeking.
You're not issuing multiple simultaneous requests - you require one read to finish before the next starts. This means that the disk may have dead time where it is doing nothing. Since all writes depend on all reads, and your reads are serial, you're starving the disk system of read requests (doubly so, since writes will take away from reads).
The total of requested read bytes across all read requests should be larger than the bandwidth-delay product of the disk system. If the disk has 0.5 ms delay and a 4 GB/sec performance, then you want to have 4 GB * 0.5 ms = 2 MB worth of reads outstanding at all times.
You're not using any of the operating system's hints that you're doing sequential reading.
To fix this:
Change your code to have more than one outstanding read request at all times.
Have enough read requests outstanding such that you're waiting on at least 2 MBs worth of data.
Use the posix_fadvise() flags to help the OS disk schedule and page cache optimize.
Consider using mmap to cut down on overhead.
Use a larger buffer size per read request to cut down on overhead.
This answer has more information:
https://stackoverflow.com/a/3756466/344638
The problem is that you specify too small buffer for your fstream
char buffer[32 * 1000];
std::ofstream of("test.file");
of.rdbuf()->pubsetbuf(buffer, sizeof(buffer));
Your app runs in the user mode. To write to disk, ofstream calls system write function that executed in kernel mode. Then write transfers data to system cache, then to HDD cache and then it will be written to the disk.
This buffer size affect number of system calls (1 call for every 32*1000 bytes). During system call OS must switch execution context from user mode to kernel mode and then back. Switching context is overhead. In Linux it is equivalent about 2500-3500 simple CPU commands. Because of that, your app spending the most CPU time in context switching.
In your second app you use
FILE* file = fopen("test.file", "w");
FILE using the bigger buffer by default, that is why it produce more efficient code. You can try to specify small buffer with setvbuf. In this case you should see the same performance degradation.
Please note in your case, the bottle neck is not HDD performance. It is context switching
I am trying to get total cpu usage in %. First I should start by saying that "top" will simply not do, as there is a delay between cpu dumps, it requires 2 dumps and several seconds, which hangs my program (I do not want to give it its own thread)
next thing what I tried is "ps" which is instant but always gives very high number in total (20+) and when I actually got my cpu to do something it stayed at about 20...
Is there any other way that I could get total cpu usage? It does not matter if it is over one second or longer periods of time... Longer periods would be more useful, though.
cat /proc/stat
http://www.linuxhowtos.org/System/procstat.htm
I agree with this answer above. The cpu line in this file gives the total number of "jiffies" your system has spent doing different types of processing.
What you need to do is take 2 readings of this file, seperated by whatever interval of time you require. The numbers are increasing values (subject to integer rollover) so to get the %cpu you need to calculate how many jiffies have elapsed over your interval, versus how many jiffies were spend doing work.
e.g.
Suppose at 14:00:00 you have
cpu 4698 591 262 8953 916 449 531
total_jiffies_1 = (sum of all values) = 16400
work_jiffies_1 = (sum of user,nice,system = the first 3 values) = 5551
and at 14:00:05 you have
cpu 4739 591 289 9961 936 449 541
total_jiffies_2 = 17506
work_jiffies_2 = 5619
So the %cpu usage over this period is:
work_over_period = work_jiffies_2 - work_jiffies_1 = 68
total_over_period = total_jiffies_2 - total_jiffies_1 = 1106
%cpu = work_over_period / total_over_period * 100 = 6.1%
Try reading /proc/loadavg. The first three numbers are the number of processes actually running (i.e., using a CPU), averaged over the last 1, 5, and 15 minutes, respectively.
http://www.linuxinsight.com/proc_loadavg.html
Read /proc/cpuinfo to find the number of CPU/cores available to the systems.
Call the getloadavg() (or alternatively read the /proc/loadavg), take the first value, multiply it by 100 (to convert to percents), divide by number of CPU/cores. If the value is greater than 100, truncate it to 100. Done.
Relevant documentation: man getloadavg and man 5 proc
N.B. Load average, usual to *NIX systems, can be more than 100% (per CPU/core) because it actually measures number of processes ready to be run by scheduler. With Windows-like CPU metric, when load is at 100% you do not really know whether it is optimal use of CPU resources or system is overloaded. Under *NIX, optimal use of CPU loadavg would give you value ~1.0 (or 2.0 for dual system). If the value is much greater than number CPU/cores, then you might want to plug extra CPUs into the box.
Otherwise, dig the /proc file system.
cpu-stat is a C++ project that permits to read Linux CPU counter from /proc/stat .
Get CPUData.* and CPUSnaphot.* files from cpu-stat's src directory.
Quick implementation to get overall cpu usage:
#include "CPUSnapshot.h"
#include <chrono>
#include <thread>
#include <iostream>
int main()
{
CPUSnapshot previousSnap;
std::this_thread::sleep_for(std::chrono::milliseconds(1000));
CPUSnapshot curSnap;
const float ACTIVE_TIME = curSnap.GetActiveTimeTotal() - previousSnap.GetActiveTimeTotal();
const float IDLE_TIME = curSnap.GetIdleTimeTotal() - previousSnap.GetIdleTimeTotal();
const float TOTAL_TIME = ACTIVE_TIME + IDLE_TIME;
int usage = 100.f * ACTIVE_TIME / TOTAL_TIME;
std::cout << "total cpu usage: " << usage << " %" << std::endl;
}
Compile it:
g++ -std=c++11 -o CPUUsage main.cpp CPUSnapshot.cpp CPUData.cpp
cat /proc/stat
http://www.linuxhowtos.org/System/procstat.htm
I suggest two files to starting...
/proc/stat and /proc/cpuinfo.
http://www.mjmwired.net/kernel/Documentation/filesystems/proc.txt
have a look at this C++ Lib.
The information is parsed from /proc/stat. it also parses memory usage from /proc/meminfo and ethernet load from /proc/net/dev
----------------------------------------------
current CPULoad:5.09119
average CPULoad 10.0671
Max CPULoad 10.0822
Min CPULoad 1.74111
CPU: : Intel(R) Core(TM) i7-10750H CPU # 2.60GHz
----------------------------------------------
network load: wlp0s20f3 : 1.9kBit/s : 920Bit/s : 1.0kBit/s : RX Bytes Startup: 15.8mByte TX Bytes Startup: 833.5mByte
----------------------------------------------
memory load: 28.4% maxmemory: 16133792 Kb used: 4581564 Kb Memload of this Process 170408 KB
----------------------------------------------
I am writing a routine to compare two files using memory-mapped file. In case files are too big to be mapped at one go. I split the files and map them part by part. For example, to map a 1049MB file, I split it into 512MB + 512MB + 25MB.
Every thing works fine except one thing: it always take much, much longer to compare the remainder (25MB in this example), though the code logic is exactly the same. 3 observations:
it does not matter which is compared first, whether the main part (512MB * N) or the remainder (25MB in this example) comes first, the result remains the same
the extra time in the remainder seems to be spent in the user mode
Profiling in VS2010 beta 1 shows, the time is spent inside t std::_Equal(), but this function is mostly (profiler says 100%) waiting for I/O and other threads.
I tried
changing the VIEW_SIZE_FACTOR to another value
replacing the lambda functor with a member function
changing the file size under test
changing the order of execution of the remainder to before/after the loop
The result was quite consistent: it takes a lot more time in the remainder part and in the User Mode.
I suspect it has something to do with the fact that the mapped size is not a multiple of mapping alignment (64K on my system), but not sure how.
Below is the complete code for the routine and a timing measured for a 3G file.
Can anyone please explain it, Thanks?
// using memory-mapped file
template <size_t VIEW_SIZE_FACTOR>
struct is_equal_by_mmapT
{
public:
bool operator()(const path_type& p1, const path_type& p2)
{
using boost::filesystem::exists;
using boost::filesystem::file_size;
try
{
if(!(exists(p1) && exists(p2))) return false;
const size_t segment_size = mapped_file_source::alignment() * VIEW_SIZE_FACTOR;
// lanmbda
boost::function<bool(size_t, size_t)> segment_compare =
[&](size_t seg_size, size_t offset)->bool
{
using boost::iostreams::mapped_file_source;
boost::chrono::run_timer t;
mapped_file_source mf1, mf2;
mf1.open(p1, seg_size, offset);
mf2.open(p2, seg_size, offset);
if(! (mf1.is_open() && mf2.is_open())) return false;
if(!equal (mf1.begin(), mf1.end(), mf2.begin())) return false;
return true;
};
boost::uintmax_t size = file_size(p1);
size_t round = size / segment_size;
size_t remainder = size & ( segment_size - 1 );
// compare the remainder
if(remainder > 0)
{
cout << "segment size = "
<< remainder
<< " bytes for the remaining round";
if(!segment_compare(remainder, segment_size * round)) return false;
}
//compare the main part. take much less time, even
for(size_t i = 0; i < round; ++i)
{
cout << "segment size = "
<< segment_size
<< " bytes, round #" << i;
if(!segment_compare(segment_size, segment_size * i)) return false;
}
}
catch(std::exception& e)
{
cout << e.what();
return false;
}
return true;
}
};
typedef is_equal_by_mmapT<(8<<10)> is_equal_by_mmap; // 512MB
output:
segment size = 354410496 bytes for the remaining round
real 116.892s, cpu 56.201s (48.1%), user 54.548s, system 1.652s
segment size = 536870912 bytes, round #0
real 72.258s, cpu 2.273s (3.1%), user 0.320s, system 1.953s
segment size = 536870912 bytes, round #1
real 75.304s, cpu 1.943s (2.6%), user 0.240s, system 1.702s
segment size = 536870912 bytes, round #2
real 84.328s, cpu 1.783s (2.1%), user 0.320s, system 1.462s
segment size = 536870912 bytes, round #3
real 73.901s, cpu 1.702s (2.3%), user 0.330s, system 1.372s
More observations after the suggestions by responders
Further split the remainder into body and tail(remainder = body + tail), where
body = N * alignment(), and tail < 1 * alignment()
body = m * alignment(), and tail < 1 * alignment() + n * alignment(), where m is even.
body = m * alignment(), and tail < 1 * alignment() + n * alignment(), where m is exponents of 2.
body = N * alignment(), and tail = remainder - body. N is random.
the total time remains unchanged, but I can see that the time does not necessary relate to tail, but to size of body and tail. the bigger part takes more time. The time is USER TIME, which is most incomprehensible to me.
I also look at the pages faults through Procexp.exe. the remainder does NOT take more faults than the main loop.
Updates 2
I've performed some test on other workstations, and it seem the issue is related to the hardware configuration.
Test Code
// compare the remainder, alternative way
if(remainder > 0)
{
//boost::chrono::run_timer t;
cout << "Remainder size = "
<< remainder
<< " bytes \n";
size_t tail = (alignment_size - 1) & remainder;
size_t body = remainder - tail;
{
boost::chrono::run_timer t;
cout << "Remainder_tail size = " << tail << " bytes";
if(!segment_compare(tail, segment_size * round + body)) return false;
}
{
boost::chrono::run_timer t;
cout << "Remainder_body size = " << body << " bytes";
if(!segment_compare(body, segment_size * round)) return false;
}
}
Observation:
On another 2 PCs with the same h/w configurations with mine, the result is consistent as following:
------VS2010Beta1ENU_VSTS.iso [1319909376 bytes] ------
Remainder size = 44840960 bytes
Remainder_tail size = 14336 bytes
real 0.060s, cpu 0.040s (66.7%), user 0.000s, system 0.040s
Remainder_body size = 44826624 bytes
real 13.601s, cpu 7.731s (56.8%), user 7.481s, system 0.250s
segment size = 67108864 bytes, total round# = 19
real 172.476s, cpu 4.356s (2.5%), user 0.731s, system 3.625s
However, running the same code on a PC with a different h/w configuration yielded:
------VS2010Beta1ENU_VSTS.iso [1319909376 bytes] ------
Remainder size = 44840960 bytes
Remainder_tail size = 14336 bytes
real 0.013s, cpu 0.000s (0.0%), user 0.000s, system 0.000s
Remainder_body size = 44826624 bytes
real 2.468s, cpu 0.188s (7.6%), user 0.047s, system 0.141s
segment size = 67108864 bytes, total round# = 19
real 65.587s, cpu 4.578s (7.0%), user 0.844s, system 3.734s
System Info
My workstation yielding imcomprehensible timing:
OS Name: Microsoft Windows XP Professional
OS Version: 5.1.2600 Service Pack 3 Build 2600
OS Manufacturer: Microsoft Corporation
OS Configuration: Member Workstation
OS Build Type: Uniprocessor Free
Original Install Date: 2004-01-27, 23:08
System Up Time: 3 Days, 2 Hours, 15 Minutes, 46 Seconds
System Manufacturer: Dell Inc.
System Model: OptiPlex GX520
System type: X86-based PC
Processor(s): 1 Processor(s) Installed.
[01]: x86 Family 15 Model 4 Stepping 3 GenuineIntel ~2992 Mhz
BIOS Version: DELL - 7
Windows Directory: C:\WINDOWS
System Directory: C:\WINDOWS\system32
Boot Device: \Device\HarddiskVolume2
System Locale: zh-cn;Chinese (China)
Input Locale: zh-cn;Chinese (China)
Time Zone: (GMT+08:00) Beijing, Chongqing, Hong Kong, Urumqi
Total Physical Memory: 3,574 MB
Available Physical Memory: 1,986 MB
Virtual Memory: Max Size: 2,048 MB
Virtual Memory: Available: 1,916 MB
Virtual Memory: In Use: 132 MB
Page File Location(s): C:\pagefile.sys
NetWork Card(s): 3 NIC(s) Installed.
[01]: VMware Virtual Ethernet Adapter for VMnet1
Connection Name: VMware Network Adapter VMnet1
DHCP Enabled: No
IP address(es)
[01]: 192.168.75.1
[02]: VMware Virtual Ethernet Adapter for VMnet8
Connection Name: VMware Network Adapter VMnet8
DHCP Enabled: No
IP address(es)
[01]: 192.168.230.1
[03]: Broadcom NetXtreme Gigabit Ethernet
Connection Name: Local Area Connection 4
DHCP Enabled: Yes
DHCP Server: 10.8.0.31
IP address(es)
[01]: 10.8.8.154
Another workstation yielding "correct" timing:
OS Name: Microsoft Windows XP Professional
OS Version: 5.1.2600 Service Pack 3 Build 2600
OS Manufacturer: Microsoft Corporation
OS Configuration: Member Workstation
OS Build Type: Multiprocessor Free
Original Install Date: 5/18/2009, 2:28:18 PM
System Up Time: 21 Days, 5 Hours, 0 Minutes, 49 Seconds
System Manufacturer: Dell Inc.
System Model: OptiPlex 755
System type: X86-based PC
Processor(s): 1 Processor(s) Installed.
[01]: x86 Family 6 Model 15 Stepping 13 GenuineIntel ~2194 Mhz
BIOS Version: DELL - 15
Windows Directory: C:\WINDOWS
System Directory: C:\WINDOWS\system32
Boot Device: \Device\HarddiskVolume1
System Locale: zh-cn;Chinese (China)
Input Locale: en-us;English (United States)
Time Zone: (GMT+08:00) Beijing, Chongqing, Hong Kong, Urumqi
Total Physical Memory: 3,317 MB
Available Physical Memory: 1,682 MB
Virtual Memory: Max Size: 2,048 MB
Virtual Memory: Available: 2,007 MB
Virtual Memory: In Use: 41 MB
Page File Location(s): C:\pagefile.sys
NetWork Card(s): 3 NIC(s) Installed.
[01]: Intel(R) 82566DM-2 Gigabit Network Connection
Connection Name: Local Area Connection
DHCP Enabled: Yes
DHCP Server: 10.8.0.31
IP address(es)
[01]: 10.8.0.137
[02]: VMware Virtual Ethernet Adapter for VMnet1
Connection Name: VMware Network Adapter VMnet1
DHCP Enabled: Yes
DHCP Server: 192.168.154.254
IP address(es)
[01]: 192.168.154.1
[03]: VMware Virtual Ethernet Adapter for VMnet8
Connection Name: VMware Network Adapter VMnet8
DHCP Enabled: Yes
DHCP Server: 192.168.2.254
IP address(es)
[01]: 192.168.2.1
Any explanation theory? Thanks.
This behavior looks quite illogical. I wonder what would happen if we tried something stupid. Provided the overall file is larger than 512MB you could compare again a full 512MB for the last part instead of the remaining size.
something like:
if(remainder > 0)
{
cout << "segment size = "
<< remainder
<< " bytes for the remaining round";
if (size > segment_size){
block_size = segment_size;
offset = size - segment_size;
}
else{
block_size = remainder;
offset = segment_size * i
}
if(!segment_compare(block_size, offset)) return false;
}
It seems a really dumb thing to do because we would be comparing part of the file two times but if your profiling figures are accurate it should be faster.
It won't give us an answer (yet) but if it is indeed faster it means the response we are looking for lies in what your program does for small blocks of data.
How fragmented is the file you are comparing with? You can use FSCTL_GET_RETRIEVAL_POINTERS to get the ranges that the file maps to on disk. I suspect the last 25 MB will have a lot of small ranges to account for the performance you have measured.
I wonder if mmap behaves strangely when a segment isn't an even number of pages in size? Maybe you can try handling the last parts of the file by progressively halving your segment sizes until you get to a size that's less than mapped_file_source::alignment() and handling that last little bit specially.
Also, you say you're doing 512MB blocks, but your code sets the size to 8<<10. It then multiplies that by mapped_file_source::alignment(). Is mapped_file_source::alignment() really 65536?
I would recommend, to be more portable and cause less confusion, that you simply use the size as given in the template parameter and simply require that it be an even multiple of mapped_file_source::alignment() in your code. Or have people pass in the power of two to start at for the block size, or something. Having the block size passed in as a template parameter then be multiplied by some strange implementation defined constant seems a little odd.
I know this isn't an exact answer to your question; but have you tried side-stepping the entire problem - i.e. just map the entire file in one go?
I know little about Win32 memory management; but on Linux you can use the MAP_NORESERVE flag with mmap(), so you don't need to reserve RAM for the entire filesize. Considering you are just reading from both files the OS should be able to throw away pages at any time if it gets short of RAM...
I would try it on a Linux or BSD just to see how it acts, out of curiousity.
I have a really rough guess about the problem:
I bet that Windows is doing a lot of extra checks to make sure it doesn't map past the end of the file. In the past there have been security problems in some OS's that allowed a mmap user to view filesystem-private data or data from other files in the area just past the end of the map, so being careful here is a good idea for a OS designer. So Windows may be using a much more careful "copy data from disk to kernel, zero out unmapped data, copy data to user" instead of the much faster "copy data from disk to user".
Try mapping to just under the end of the file, excluding the last bytes that don't fit into a 64K block.
Could it be that a virus scanner is causing these strange results? Have you tried without virus scanner?
Regards,
Sebastiaan