incomprehensible time consumed in using memory mapped file - c++

I am writing a routine to compare two files using memory-mapped file. In case files are too big to be mapped at one go. I split the files and map them part by part. For example, to map a 1049MB file, I split it into 512MB + 512MB + 25MB.
Every thing works fine except one thing: it always take much, much longer to compare the remainder (25MB in this example), though the code logic is exactly the same. 3 observations:
it does not matter which is compared first, whether the main part (512MB * N) or the remainder (25MB in this example) comes first, the result remains the same
the extra time in the remainder seems to be spent in the user mode
Profiling in VS2010 beta 1 shows, the time is spent inside t std::_Equal(), but this function is mostly (profiler says 100%) waiting for I/O and other threads.
I tried
changing the VIEW_SIZE_FACTOR to another value
replacing the lambda functor with a member function
changing the file size under test
changing the order of execution of the remainder to before/after the loop
The result was quite consistent: it takes a lot more time in the remainder part and in the User Mode.
I suspect it has something to do with the fact that the mapped size is not a multiple of mapping alignment (64K on my system), but not sure how.
Below is the complete code for the routine and a timing measured for a 3G file.
Can anyone please explain it, Thanks?
// using memory-mapped file
template <size_t VIEW_SIZE_FACTOR>
struct is_equal_by_mmapT
{
public:
bool operator()(const path_type& p1, const path_type& p2)
{
using boost::filesystem::exists;
using boost::filesystem::file_size;
try
{
if(!(exists(p1) && exists(p2))) return false;
const size_t segment_size = mapped_file_source::alignment() * VIEW_SIZE_FACTOR;
// lanmbda
boost::function<bool(size_t, size_t)> segment_compare =
[&](size_t seg_size, size_t offset)->bool
{
using boost::iostreams::mapped_file_source;
boost::chrono::run_timer t;
mapped_file_source mf1, mf2;
mf1.open(p1, seg_size, offset);
mf2.open(p2, seg_size, offset);
if(! (mf1.is_open() && mf2.is_open())) return false;
if(!equal (mf1.begin(), mf1.end(), mf2.begin())) return false;
return true;
};
boost::uintmax_t size = file_size(p1);
size_t round = size / segment_size;
size_t remainder = size & ( segment_size - 1 );
// compare the remainder
if(remainder > 0)
{
cout << "segment size = "
<< remainder
<< " bytes for the remaining round";
if(!segment_compare(remainder, segment_size * round)) return false;
}
//compare the main part. take much less time, even
for(size_t i = 0; i < round; ++i)
{
cout << "segment size = "
<< segment_size
<< " bytes, round #" << i;
if(!segment_compare(segment_size, segment_size * i)) return false;
}
}
catch(std::exception& e)
{
cout << e.what();
return false;
}
return true;
}
};
typedef is_equal_by_mmapT<(8<<10)> is_equal_by_mmap; // 512MB
output:
segment size = 354410496 bytes for the remaining round
real 116.892s, cpu 56.201s (48.1%), user 54.548s, system 1.652s
segment size = 536870912 bytes, round #0
real 72.258s, cpu 2.273s (3.1%), user 0.320s, system 1.953s
segment size = 536870912 bytes, round #1
real 75.304s, cpu 1.943s (2.6%), user 0.240s, system 1.702s
segment size = 536870912 bytes, round #2
real 84.328s, cpu 1.783s (2.1%), user 0.320s, system 1.462s
segment size = 536870912 bytes, round #3
real 73.901s, cpu 1.702s (2.3%), user 0.330s, system 1.372s
More observations after the suggestions by responders
Further split the remainder into body and tail(remainder = body + tail), where
body = N * alignment(), and tail < 1 * alignment()
body = m * alignment(), and tail < 1 * alignment() + n * alignment(), where m is even.
body = m * alignment(), and tail < 1 * alignment() + n * alignment(), where m is exponents of 2.
body = N * alignment(), and tail = remainder - body. N is random.
the total time remains unchanged, but I can see that the time does not necessary relate to tail, but to size of body and tail. the bigger part takes more time. The time is USER TIME, which is most incomprehensible to me.
I also look at the pages faults through Procexp.exe. the remainder does NOT take more faults than the main loop.
Updates 2
I've performed some test on other workstations, and it seem the issue is related to the hardware configuration.
Test Code
// compare the remainder, alternative way
if(remainder > 0)
{
//boost::chrono::run_timer t;
cout << "Remainder size = "
<< remainder
<< " bytes \n";
size_t tail = (alignment_size - 1) & remainder;
size_t body = remainder - tail;
{
boost::chrono::run_timer t;
cout << "Remainder_tail size = " << tail << " bytes";
if(!segment_compare(tail, segment_size * round + body)) return false;
}
{
boost::chrono::run_timer t;
cout << "Remainder_body size = " << body << " bytes";
if(!segment_compare(body, segment_size * round)) return false;
}
}
Observation:
On another 2 PCs with the same h/w configurations with mine, the result is consistent as following:
------VS2010Beta1ENU_VSTS.iso [1319909376 bytes] ------
Remainder size = 44840960 bytes
Remainder_tail size = 14336 bytes
real 0.060s, cpu 0.040s (66.7%), user 0.000s, system 0.040s
Remainder_body size = 44826624 bytes
real 13.601s, cpu 7.731s (56.8%), user 7.481s, system 0.250s
segment size = 67108864 bytes, total round# = 19
real 172.476s, cpu 4.356s (2.5%), user 0.731s, system 3.625s
However, running the same code on a PC with a different h/w configuration yielded:
------VS2010Beta1ENU_VSTS.iso [1319909376 bytes] ------
Remainder size = 44840960 bytes
Remainder_tail size = 14336 bytes
real 0.013s, cpu 0.000s (0.0%), user 0.000s, system 0.000s
Remainder_body size = 44826624 bytes
real 2.468s, cpu 0.188s (7.6%), user 0.047s, system 0.141s
segment size = 67108864 bytes, total round# = 19
real 65.587s, cpu 4.578s (7.0%), user 0.844s, system 3.734s
System Info
My workstation yielding imcomprehensible timing:
OS Name: Microsoft Windows XP Professional
OS Version: 5.1.2600 Service Pack 3 Build 2600
OS Manufacturer: Microsoft Corporation
OS Configuration: Member Workstation
OS Build Type: Uniprocessor Free
Original Install Date: 2004-01-27, 23:08
System Up Time: 3 Days, 2 Hours, 15 Minutes, 46 Seconds
System Manufacturer: Dell Inc.
System Model: OptiPlex GX520
System type: X86-based PC
Processor(s): 1 Processor(s) Installed.
[01]: x86 Family 15 Model 4 Stepping 3 GenuineIntel ~2992 Mhz
BIOS Version: DELL - 7
Windows Directory: C:\WINDOWS
System Directory: C:\WINDOWS\system32
Boot Device: \Device\HarddiskVolume2
System Locale: zh-cn;Chinese (China)
Input Locale: zh-cn;Chinese (China)
Time Zone: (GMT+08:00) Beijing, Chongqing, Hong Kong, Urumqi
Total Physical Memory: 3,574 MB
Available Physical Memory: 1,986 MB
Virtual Memory: Max Size: 2,048 MB
Virtual Memory: Available: 1,916 MB
Virtual Memory: In Use: 132 MB
Page File Location(s): C:\pagefile.sys
NetWork Card(s): 3 NIC(s) Installed.
[01]: VMware Virtual Ethernet Adapter for VMnet1
Connection Name: VMware Network Adapter VMnet1
DHCP Enabled: No
IP address(es)
[01]: 192.168.75.1
[02]: VMware Virtual Ethernet Adapter for VMnet8
Connection Name: VMware Network Adapter VMnet8
DHCP Enabled: No
IP address(es)
[01]: 192.168.230.1
[03]: Broadcom NetXtreme Gigabit Ethernet
Connection Name: Local Area Connection 4
DHCP Enabled: Yes
DHCP Server: 10.8.0.31
IP address(es)
[01]: 10.8.8.154
Another workstation yielding "correct" timing:
OS Name: Microsoft Windows XP Professional
OS Version: 5.1.2600 Service Pack 3 Build 2600
OS Manufacturer: Microsoft Corporation
OS Configuration: Member Workstation
OS Build Type: Multiprocessor Free
Original Install Date: 5/18/2009, 2:28:18 PM
System Up Time: 21 Days, 5 Hours, 0 Minutes, 49 Seconds
System Manufacturer: Dell Inc.
System Model: OptiPlex 755
System type: X86-based PC
Processor(s): 1 Processor(s) Installed.
[01]: x86 Family 6 Model 15 Stepping 13 GenuineIntel ~2194 Mhz
BIOS Version: DELL - 15
Windows Directory: C:\WINDOWS
System Directory: C:\WINDOWS\system32
Boot Device: \Device\HarddiskVolume1
System Locale: zh-cn;Chinese (China)
Input Locale: en-us;English (United States)
Time Zone: (GMT+08:00) Beijing, Chongqing, Hong Kong, Urumqi
Total Physical Memory: 3,317 MB
Available Physical Memory: 1,682 MB
Virtual Memory: Max Size: 2,048 MB
Virtual Memory: Available: 2,007 MB
Virtual Memory: In Use: 41 MB
Page File Location(s): C:\pagefile.sys
NetWork Card(s): 3 NIC(s) Installed.
[01]: Intel(R) 82566DM-2 Gigabit Network Connection
Connection Name: Local Area Connection
DHCP Enabled: Yes
DHCP Server: 10.8.0.31
IP address(es)
[01]: 10.8.0.137
[02]: VMware Virtual Ethernet Adapter for VMnet1
Connection Name: VMware Network Adapter VMnet1
DHCP Enabled: Yes
DHCP Server: 192.168.154.254
IP address(es)
[01]: 192.168.154.1
[03]: VMware Virtual Ethernet Adapter for VMnet8
Connection Name: VMware Network Adapter VMnet8
DHCP Enabled: Yes
DHCP Server: 192.168.2.254
IP address(es)
[01]: 192.168.2.1
Any explanation theory? Thanks.

This behavior looks quite illogical. I wonder what would happen if we tried something stupid. Provided the overall file is larger than 512MB you could compare again a full 512MB for the last part instead of the remaining size.
something like:
if(remainder > 0)
{
cout << "segment size = "
<< remainder
<< " bytes for the remaining round";
if (size > segment_size){
block_size = segment_size;
offset = size - segment_size;
}
else{
block_size = remainder;
offset = segment_size * i
}
if(!segment_compare(block_size, offset)) return false;
}
It seems a really dumb thing to do because we would be comparing part of the file two times but if your profiling figures are accurate it should be faster.
It won't give us an answer (yet) but if it is indeed faster it means the response we are looking for lies in what your program does for small blocks of data.

How fragmented is the file you are comparing with? You can use FSCTL_GET_RETRIEVAL_POINTERS to get the ranges that the file maps to on disk. I suspect the last 25 MB will have a lot of small ranges to account for the performance you have measured.

I wonder if mmap behaves strangely when a segment isn't an even number of pages in size? Maybe you can try handling the last parts of the file by progressively halving your segment sizes until you get to a size that's less than mapped_file_source::alignment() and handling that last little bit specially.
Also, you say you're doing 512MB blocks, but your code sets the size to 8<<10. It then multiplies that by mapped_file_source::alignment(). Is mapped_file_source::alignment() really 65536?
I would recommend, to be more portable and cause less confusion, that you simply use the size as given in the template parameter and simply require that it be an even multiple of mapped_file_source::alignment() in your code. Or have people pass in the power of two to start at for the block size, or something. Having the block size passed in as a template parameter then be multiplied by some strange implementation defined constant seems a little odd.

I know this isn't an exact answer to your question; but have you tried side-stepping the entire problem - i.e. just map the entire file in one go?
I know little about Win32 memory management; but on Linux you can use the MAP_NORESERVE flag with mmap(), so you don't need to reserve RAM for the entire filesize. Considering you are just reading from both files the OS should be able to throw away pages at any time if it gets short of RAM...

I would try it on a Linux or BSD just to see how it acts, out of curiousity.
I have a really rough guess about the problem:
I bet that Windows is doing a lot of extra checks to make sure it doesn't map past the end of the file. In the past there have been security problems in some OS's that allowed a mmap user to view filesystem-private data or data from other files in the area just past the end of the map, so being careful here is a good idea for a OS designer. So Windows may be using a much more careful "copy data from disk to kernel, zero out unmapped data, copy data to user" instead of the much faster "copy data from disk to user".
Try mapping to just under the end of the file, excluding the last bytes that don't fit into a 64K block.

Could it be that a virus scanner is causing these strange results? Have you tried without virus scanner?
Regards,
Sebastiaan

Related

cxxrt::bad_alloc despite large EPC

I am running the following piece of code inside an SGX enclave:
void test_enclave_size() {
unsigned int i = 0;
const unsigned int MB = 1024 * 1024;
try {
for (; i < 10000; i++) {
char* tmp = new char[MB];
}
} catch (const std::exception &e) {
std::cout << "Crash with " << e.what() << " " << i << std::endl;
}
}
On my dev machine with the standard 128 MB EPC this throws a bad cxxrt::bad_alloc after 118 MB, which makes sense because I believe only 96 MB is guaranteed to be available for enclave programs. However, when running this code on a Standard_DC32s_V3, which has 192 GB of EPC memory, I get the exact same result. I assumed that because the EPC is advertised to be extremely large, I should be able to allocate far more than 128 MB.
I have thought of a couple of reasons why this might be happening:
While the EPC is now 192 GB in size, each process is still limited to 128 MB.
There is something in the kernel that needs to be enabled for me to take advantage of this large EPC.
I am misunderstanding what Azure is advertising.
I wanted to see if anyone has a good idea of what is happening before contact Azure support, since this might be a user error.
Edit:
It turns out my second reason was the closest. As X99 pointed out, when developing an enclave application there is a configuration file that defines several factors such as the number of thread contexts, whether debugging is enabled, and max heap/stack size. My maximum heap size in my configuration was set to about 118 MB, which explains why I started to get bad allocations past this amount. Once I increased the number, the issue went away. Size note: if you are on Linux, the drivers support paging. This means you can use as much memory as you wish if you can afford to suffer the paging overhead.
If you are using Open Enclave as your SDK, this configuration file (example) is what you should be editing. In this example, the maximum heap and stack are 1024 pages, which is about 4MB. This page may be useful to you as well!
If the machine you're running your enclave on has more than 128Mb of EPC AND will allow you to go further (because of a BIOS setting), there is one more setting you must fiddle with, in your Enclave.config.xml file:
<EnclaveConfiguration>
<ProdID>0</ProdID>
<ISVSVN>0</ISVSVN>
<StackMaxSize>0x40000</StackMaxSize>
<HeapMaxSize>0x100000</HeapMaxSize>
<TCSNum>10</TCSNum>
<TCSPolicy>1</TCSPolicy>
<!-- Recommend changing 'DisableDebug' to 1 to make the enclave undebuggable for enclave release -->
<DisableDebug>0</DisableDebug>
<MiscSelect>0</MiscSelect>
<MiscMask>0xFFFFFFFF</MiscMask>
</EnclaveConfiguration>
To be more specific, the HeapMaxSize value. See the SGX developper reference, page 58/59.

How to diagnose a visual studio project slowing down as time goes on?

Computer:
Processor: Intel Xeon Silver 4114 CPU # 2.19Ghz (2 processors)
Ram: 96 Gb 2666 Hz: 12 - 8 Gb sticks
OS: Windows 10
GPU: None
Hard drive: Samsung MZVLB512HAJQ-000H2 - 512GB M.2 PCIe NVMe
IDE:
Visual Studio 2019
I am including what I am doing in case it is relevant. I am running a visual studio code where I read data off a GSC PCI SIO4B Sync Card 256K. Using the API for this card (Documentation: http://www.generalstandards.com/downloads/GscApi.1.6.10.1.pdf) I read 150 bytes of data at a speed of 100Hz using the code below. That data is then being split into to the message structure my device. I can’t give info on the message structure but the data is then combined into the various words using a union and added to an integer array int Data[100];
Union Example:
union data_set{
unsigned int integer;
unsigned char input[2];
} word;
Example of how the data is read read:
PLX_PHYSICAL_MEM cpRxBuffer;
#define TEST_BUFFER_SIZE 0x400
//allocates memory for the buffer
cpRxBuffer.Size = TEST_BUFFER_SIZE;
status = GscAllocPhysicalMemory(BoardNum, &cpRxBuffer);
status = GscMapPhysicalMemory(BoardNum, &cpRxBuffer);
memset((unsigned char*)cpRxBuffer.UserAddr, 0xa5, sizeof(cpRxBuffer));
// start data reception:
status = GscSio4ChannelReceivePlxPhysData(BoardNum, iRxChannel, &cpRxBuffer, SetMaxBytes, &messageID);
// wait for Rx operation to complete
status = GscSio4ChannelWaitForTransfer(BoardNum, iRxChannel, 7000, messageID, &amount);
if (status)
{
// If we have an error, "bytesTransferred" will contain the number of bytes that we
// actually transmitted.
DisplayErrorMessage(status);
printf("\n\t%04X bytes out of %04X transferred", amount, SetMaxBytes);
}
My issue is that this code works fine and keeps up for around 5 minutes then randomly it stops being able to keep up and the FIFO (first in first out) register on the PCI card begins to fill up faster than the code can process the data. To me this seems like a memory leak issue since the code works fine for a long time, then starts to slow down when nothing has changed as all the code is doing it reading the data off the card. We used to save the data in a really large array but even after removing that we had the same issue.
I am unsure how to figure out exactly what is happening and I'm hopping for a way to determine if there is a memory leak and how to fix it if there is.
It being a data leak is only a guess though and it very well could be something else that is the problem so any out of the box suggestions for diagnosing the problem are also appreciated.
Similar to Paul's answer, but I like to strategically place two (or more) _CrtMemCheckpoint followed by _CrtMemDifference, to cut down the noise.
Memory leaks can be detected and reported on (in Debug builds) by calling the _CrtDumpMemoryLeaks function. When running under the debugger, this will tell you (in the output tab) how many allocations you have at the time that it is called and the file and line number that each was allocated from.
Call this right at the end of your program, after you (think you) have freed all the resources you use. Anything left over is a candidate for being a leak.

Speed difference between AES/CBC encryption and decryption?

I am wondering, theoretically, how much slower would AES/CBC decryption be compared to AES/CBC encryption with the following conditions:
Encryption key of 32 bytes (256 bits);
A blocksize of 16 bytes (128 bits).
The reason that I ask is that I want to know if the decryption speed of an implementation that I have is not abnormally slow. I have done some tests on random memory blocks of different sizes. The results are as follows:
64B:
64KB:
10MB – 520MB:
All data was stored on the internal memory of my system. The application generates the data to encrypt by itself. Virtual memory is disabled on the test PC so that there would not be any I/O calls.
When analyzing the table, does the difference between encryption and decryption imply that my implementation is abnormally slow? Have I done something wrong?
Update:
This test is executed on another pc;
This test is executed with random data;
Crypto++ is used for the AES/CBC encryption and decryption.
The decryption implementation is as follows:
CryptoPP::AES::Decryption aesDecryption(aesKey, ENCRYPTION_KEY_SIZE_AES);
CryptoPP::CBC_Mode_ExternalCipher::Decryption cbcDecryption(aesDecryption, aesIv);
CryptoPP::ArraySink * decSink = new CryptoPP::ArraySink(data, dataSizeMax);
CryptoPP::StreamTransformationFilter stfDecryptor(cbcDecryption, decSink);
stfDecryptor.Put(reinterpret_cast<const unsigned char*>(ciphertext), cipherSize);
stfDecryptor.MessageEnd();
*dataOutputSize = decSink->TotalPutLength();
Update 2:
Added result for 64 byte blocks
As symmetric encryption, encryption and decryption should be fairly close in speed. Not sure about your implementation but there are ways to optimize if you're concerned about how the algorithm was used. In experiments, AES is not the fastest and CBC will add security but slow it down. Here's a comparison, since you're asking about key and block sizes:
When analyzing the table, does the difference between encryption and decryption imply that my implementation is abnormally slow? Have I done something wrong?
Three or four things jump out at me. I kind of agree with #JamesKPolk - the numbers look off. First, crypto libraries are usually bench-marked with CTR mode, not CBC mode. Also see the SUPERCOP benchmarks. Any you have to use cycles-per-byte (cpb) to normalize measurement units across machines. Saying "9 MB/s" without a context means nothing.
Second, we need to know the machine and its CPU frequency. It looks like you are pushing data at 9 MB/s for encryption and 6.5 MB/s for decryption. A modern iCore machine, like a Core-i5 running at 2.7 GHz, will push CBC mode data at around 2.5 or 3.0 cpb, which is roughly 980 MB/s or 1 GB/s. Even my old Core2 Duo running at 2.0 GHz moves data faster than you are showing. The Core 2 moves data at 14.5 cpb or 130 MB/s.
Third, scrap this code. There is a lot of room for improvement so it is not worth critiquing in this context; and the recommended code is below. Worth mentioning is, you are creating a lot of objects like ArraySource and StreamTransformationFilter. The filter adds padding which perturbs the AES encryption and decryption benchmarks and skews the results. You only need an encryption object, and then you only need to call ProcessBlock or ProcessString.
CryptoPP::AES::Decryption aesDecryption(aesKey, ENCRYPTION_KEY_SIZE_AES);
CryptoPP::CBC_Mode_ExternalCipher::Decryption cbcDecryption(aesDecryption, aesIv);
...
Fourth, the Crypto++ wiki has a Benchmarks article with the following code. Its a new section and was not available when you asked your question. Here's how to run your test.
AutoSeededRandomPool prng;
SecByteBlock key(16);
prng.GenerateBlock(key, key.size());
CTR_Mode<AES>::Encryption cipher;
cipher.SetKeyWithIV(key, key.size(), key);
const int BUF_SIZE = RoundUpToMultipleOf(2048U,
dynamic_cast<StreamTransformation&>(cipher).OptimalBlockSize());
AlignedSecByteBlock buf(BUF_SIZE);
prng.GenerateBlock(buf, buf.size());
const double runTimeInSeconds = 3.0;
const double cpuFreq = 2.7 * 1000 * 1000 * 1000;
double elapsedTimeInSeconds;
unsigned long i=0, blocks=1;
ThreadUserTimer timer;
timer.StartTimer();
do
{
blocks *= 2;
for (; i<blocks; i++)
cipher.ProcessString(buf, BUF_SIZE);
elapsedTimeInSeconds = timer.ElapsedTimeAsDouble();
}
while (elapsedTimeInSeconds < runTimeInSeconds);
const double bytes = static_cast<double>(BUF_SIZE) * blocks;
const double ghz = cpuFreq / 1000 / 1000 / 1000;
const double mbs = bytes / 1024 / 1024 / elapsedTimeInSeconds;
const double cpb = elapsedTimeInSeconds * cpuFreq / bytes;
std::cout << cipher.AlgorithmName() << " benchmarks..." << std::endl;
std::cout << " " << ghz << " GHz cpu frequency" << std::endl;
std::cout << " " << cpb << " cycles per byte (cpb)" << std::endl;
std::cout << " " << mbs << " MiB per second (MiB)" << std::endl;
Running the code on a Core-i5 6400 at 2.7 GHz results in:
$ ./bench.exe
AES/CTR benchmarks...
2.7 GHz cpu frequency
0.58228 cycles per byte (cpb)
4422.13 MiB per second (MiB)
Fifth, when I modify the benchmark program shown above to operate of 64-byte blocks:
const int BUF_SIZE = 64;
unsigned int blocks = 0;
...
do
{
blocks++;
cipher.ProcessString(buf, BUF_SIZE);
elapsedTimeInSeconds = timer.ElapsedTimeAsDouble();
}
while (elapsedTimeInSeconds < runTimeInSeconds);
I see 3.4 cpb or 760 MB/s for the Core-i5 6400 at 2.7 GHz for 64-byte blocks. The library takes a hit for small buffers, but most (all?) libraries do.
$ ./bench.exe
AES/CTR benchmarks...
2.7 GHz cpu frequency
3.39823 cycles per byte (cpb)
757.723 MiB per second (MiB)
Sixth, you need to get the processor out of powersave mode or a low energy state for best/most consistent results. The library uses governor.sh to do it on Linux. It is available in the TestScript/ directory.
Seventh, when I switch to CTR mode decryption with:
CTR_Mode<AES>::Decryption cipher;
cipher.SetKeyWithIV(key, key.size(), key);
Then I see about the same rate for bulk decryption:
$ ./bench.exe
AES/CTR benchmarks...
2.7 GHz cpu frequency
0.579923 cycles per byte (cpb)
4440.11 MiB per second (MiB)
Eigth, here is a collection of benchmark numbers on a bunch of different machines. It should provide a rough target as you are tuning your tests.
My Beaglebone dev-board running at 980 MHz is moving data at twice the rate you are reporting. The Beaglebone achieves a boring 40 cpb with 20 MB/s because it is staright C/C++; and not optimized for A-32.
Skylake Core-i5 #2.7 GHz
https://www.cryptopp.com/benchmarks.html
Core2 Duo # 2.0 GHz
https://www.cryptopp.com/benchmarks-core2.html
LeMaker HiKey Kirik SoC Aarch64 #1.2 GHz
https://www.cryptopp.com/benchmarks-hikey.html
AMD Opteron Aarch64 #2.0 GHz
https://www.cryptopp.com/benchmarks-opteron.html
BananaPi Cortex-A7 dev-board #860MHz
https://www.cryptopp.com/benchmarks-bananapi.html
IBM Power8 Server #4.1 GHz
https://www.cryptopp.com/benchmarks-power8.html
My takeaways are:
CTR mode bulk encryption and decryption are about the same on modern machines
CTR mode key setup are not the same, decryption takes a little longer on modern machines
Small block sizes are more expensive than large block sizes
It is everything I expect to see.
I think your next step is to collect some data using the sample program on the Crypto++ wiki and then evaluate the results.
Theoretically, AES decryption is 30% slower. This is a property of Rijndael systems in general.
Source: http://www4.ncsu.edu/~hartwig/Teaching/437/aes.pdf

Performance Issue in Executing Shell Commands

In my application, I need to execute large amount of shell commands via c++ code. I found the program takes more than 30 seconds to execute 6000 commands, this is so unacceptable! Is there any other better way to execute shell commands (using c/c++ code)?
//Below functions is used to set rules for
//Linux tool --TC, and in runtime there will
//be more than 6000 rules to be set from shell
//those TC commans are like below example:
//tc qdisc del dev eth0 root
//tc qdisc add dev eth0 root handle 1:0 cbq bandwidth
// 10Mbit avpkt 1000 cell 8
//tc class add dev eth0 parent 1:0 classid 1:1 cbq bandwidth
// 100Mbit rate 8000kbit weight 800kbit prio 5 allot 1514
// cell 8 maxburst 20 avpkt 1000 bounded
//tc class add dev eth0 parent 1:0 classid 1:2 cbq bandwidth
// 100Mbit rate 800kbit weight 80kbit prio 5 allot 1514 cell
// 8 maxburst 20 avpkt 1000 bounded
//tc class add dev eth0 parent 1:0 classid 1:3 cbq bandwidth
// 100Mbit rate 800kbit weight 80kbit prio 5 allot 1514 cell
// 8 maxburst 20 avpkt 1000 bounded
//tc class add dev eth0 parent 1:1 classid 1:1001 cbq bandwidth
// 100Mbit rate 8000kbit weight 800kbit prio 8 allot 1514 cell
// 8 maxburst 20 avpkt 1000
//......
void CTCProxy::ApplyTCCommands(){
FILE* OutputStream = NULL;
//mTCCommands is a vector<string>
//every string in it is a TC rule
int CmdCount = mTCCommands.size();
for (int i = 0; i < CmdCount; i++){
OutputStream = popen(mTCCommands[i].c_str(), "r");
if (OutputStream){
pclose(OutputStream);
} else {
printf("popen error!\n");
}
}
}
UPDATE
I tried to put all the shell commands into a shell script and let the test app call this script file using system("xxx.sh"). This time it takes 24 seconds to execute all 6000 entries of shell commands, less than what we toke before. But this is still much bigger than what we expected! Is there any other way that can decrease the execution time to less than 10 seconds?
So, most likely (based on my experience in a similar type of thing), the majority of the time is spent starting a new process running a shell, the execution of the actual command in the shell is very short. (And 6000 in 30 seconds doesn't sound too terrible, actually).
There are a variety of ways you could do this. I'd be tempted to try to combine it all into one shell script, rather than running individual lines. This would involve writing all the 'tc' strings to a file, and then passing that to popen().
Another thought is if you can actually combine several strings together into one execute, perhaps?
If the commands are complete and directly executable (that is, no shell is needed to execute the program), you could also do your own fork and exec. This would save creating a shell process, which then creates the actual process.
Also, you may consider running a small number of processes in parallel, which on any modern machine will likely speed things up by the number of processor cores you have.
You can start shell (/bin/sh) and pipe all commands there parsing the output. Or you can create a Makefile as this would give you more control on how the commands whould be executed, parallel execution and error handling.

How to get total cpu usage in Linux using C++

I am trying to get total cpu usage in %. First I should start by saying that "top" will simply not do, as there is a delay between cpu dumps, it requires 2 dumps and several seconds, which hangs my program (I do not want to give it its own thread)
next thing what I tried is "ps" which is instant but always gives very high number in total (20+) and when I actually got my cpu to do something it stayed at about 20...
Is there any other way that I could get total cpu usage? It does not matter if it is over one second or longer periods of time... Longer periods would be more useful, though.
cat /proc/stat
http://www.linuxhowtos.org/System/procstat.htm
I agree with this answer above. The cpu line in this file gives the total number of "jiffies" your system has spent doing different types of processing.
What you need to do is take 2 readings of this file, seperated by whatever interval of time you require. The numbers are increasing values (subject to integer rollover) so to get the %cpu you need to calculate how many jiffies have elapsed over your interval, versus how many jiffies were spend doing work.
e.g.
Suppose at 14:00:00 you have
cpu 4698 591 262 8953 916 449 531
total_jiffies_1 = (sum of all values) = 16400
work_jiffies_1 = (sum of user,nice,system = the first 3 values) = 5551
and at 14:00:05 you have
cpu 4739 591 289 9961 936 449 541
total_jiffies_2 = 17506
work_jiffies_2 = 5619
So the %cpu usage over this period is:
work_over_period = work_jiffies_2 - work_jiffies_1 = 68
total_over_period = total_jiffies_2 - total_jiffies_1 = 1106
%cpu = work_over_period / total_over_period * 100 = 6.1%
Try reading /proc/loadavg. The first three numbers are the number of processes actually running (i.e., using a CPU), averaged over the last 1, 5, and 15 minutes, respectively.
http://www.linuxinsight.com/proc_loadavg.html
Read /proc/cpuinfo to find the number of CPU/cores available to the systems.
Call the getloadavg() (or alternatively read the /proc/loadavg), take the first value, multiply it by 100 (to convert to percents), divide by number of CPU/cores. If the value is greater than 100, truncate it to 100. Done.
Relevant documentation: man getloadavg and man 5 proc
N.B. Load average, usual to *NIX systems, can be more than 100% (per CPU/core) because it actually measures number of processes ready to be run by scheduler. With Windows-like CPU metric, when load is at 100% you do not really know whether it is optimal use of CPU resources or system is overloaded. Under *NIX, optimal use of CPU loadavg would give you value ~1.0 (or 2.0 for dual system). If the value is much greater than number CPU/cores, then you might want to plug extra CPUs into the box.
Otherwise, dig the /proc file system.
cpu-stat is a C++ project that permits to read Linux CPU counter from /proc/stat .
Get CPUData.* and CPUSnaphot.* files from cpu-stat's src directory.
Quick implementation to get overall cpu usage:
#include "CPUSnapshot.h"
#include <chrono>
#include <thread>
#include <iostream>
int main()
{
CPUSnapshot previousSnap;
std::this_thread::sleep_for(std::chrono::milliseconds(1000));
CPUSnapshot curSnap;
const float ACTIVE_TIME = curSnap.GetActiveTimeTotal() - previousSnap.GetActiveTimeTotal();
const float IDLE_TIME = curSnap.GetIdleTimeTotal() - previousSnap.GetIdleTimeTotal();
const float TOTAL_TIME = ACTIVE_TIME + IDLE_TIME;
int usage = 100.f * ACTIVE_TIME / TOTAL_TIME;
std::cout << "total cpu usage: " << usage << " %" << std::endl;
}
Compile it:
g++ -std=c++11 -o CPUUsage main.cpp CPUSnapshot.cpp CPUData.cpp
cat /proc/stat
http://www.linuxhowtos.org/System/procstat.htm
I suggest two files to starting...
/proc/stat and /proc/cpuinfo.
http://www.mjmwired.net/kernel/Documentation/filesystems/proc.txt
have a look at this C++ Lib.
The information is parsed from /proc/stat. it also parses memory usage from /proc/meminfo and ethernet load from /proc/net/dev
----------------------------------------------
current CPULoad:5.09119
average CPULoad 10.0671
Max CPULoad 10.0822
Min CPULoad 1.74111
CPU: : Intel(R) Core(TM) i7-10750H CPU # 2.60GHz
----------------------------------------------
network load: wlp0s20f3 : 1.9kBit/s : 920Bit/s : 1.0kBit/s : RX Bytes Startup: 15.8mByte TX Bytes Startup: 833.5mByte
----------------------------------------------
memory load: 28.4% maxmemory: 16133792 Kb used: 4581564 Kb Memload of this Process 170408 KB
----------------------------------------------