Slow Sqlite3 Select Query while Insert/Update/Delete has no issue - c++

I am stuck in a issue related to slow sqlite3 select. I have searched alot on this forum and have applied many of the suggestion which has somewhere helped me in moving ahead. I assume there are some fault in the way i am trying to use sqlite or may be the settings which i have used while compiling it.
I decided to use sqlite3 in c++ after reading a lot about its performance. Since the data inflow is very high and the server is set in co-location at exchange , if the packet processing is delayed due to any reason there canbe a packet drop and a delayed packet is of no use in High Frequency
trading environment. There can be a minimum packet flow of 5 mbps where each packet is of a maximum 45 bytes size. My sqlite is set for In Memory
use.
To understand my comlete issue please go through the details below.
Below are the details of the Server:
Server Details on which i am trying to use Sqlite3
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 45
Model name: Intel(R) Xeon(R) CPU E5-2643 0 # 3.30GHz
Stepping: 7
CPU MHz: 3400.160
BogoMIPS: 6603.86
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 10240K
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15
Ram: 48 gb
Operating System : CentOS 7
Kernel Version : Linux version 3.10.0-123.el7.x86_64 (builder#kbuilder.dev.centos.org) (gcc version 4.8.2 20140120 (Red Hat 4.8.2-16) (GCC) )
Details of process which i am using:
Compiling Sqlite3 with the following command
./configure --prefix=/usr --disable-static CFLAGS="-O3 -m64 -DSQLITE_DEFAULT_SYNCHRONOUS=0 -DSQLITE_CONFIG_SINGLETHREAD -DSQLITE_DEFAULT_AUTOMATIC_INDEX=0 -DSQLITE_DEFAULT_PAGE_SIZE=4096 -DSQLITE_DEFAULT_CACHE_SIZE=4000 -DHAVE_FDATASYNC=0"
Create Table Query :
create table 'Stream0' ( TokenNo int NOT NULL,OrderId integer NOT NULL,SIDE int NOT NULL,PRICE int NOT NULL,QTY int NOT NULL,PRIMARY KEY (OrderId));
Index On Table :
CREATE INDEX DataFilterIndex ON 'Stream0'(TokenNo , SIDE, Price,Qty);
Pragma Statement :
void SqliteManager::SetPragma()
{
rc= sqlite3_exec(db, "PRAGMA synchronous = OFF", NULL, NULL, &zErrMsg);
rc= sqlite3_exec(db, "PRAGMA count_changes = false", NULL, NULL, &zErrMsg);
rc= sqlite3_exec(db, "PRAGMA journal_mode = OFF", NULL, NULL, &zErrMsg);
}
Preparing Sqlite Query :
MyString <<"insert or replace into 'Stream0' values( ?1,?2,?3,?4,?5);";
rc= sqlite3_prepare_v2(db,MyString.str().c_str(),strlen(MyString.str().c_str()),&insert_stmt,NULL);
Note : -- Insert or replace has been used because the incoming data for any specified TokenNo may be with modify tag without any insert tag prior to this.
MyString.str(std::string());
MyString <<"delete from 'Stream0' where OrderId = ?1;";
rc = sqlite3_prepare_v2(db,MyString.str().c_str(),strlen(MyString.str().c_str()),&delete_stmt,NULL);
Once either data for specific TokenNo is asked to delete/Insert/Modify a select statement is raised to publish data to the user.
Select Statement:
MyString<<"select TokenNo,Price ,sum(QTY) from 'Stream0' where TokenNo=?1 and Side=66 group by Price order by Price desc limit 5";
rc = sqlite3_prepare_v2(db,MyString.str().c_str(),strlen(MyString.str().c_str()),&select_bid_stmt,NULL);
Here Side = 66 stands for price of Buyers and Price desc says price sorted in decreasing order.
One more is there which has Side = 83 which stands for Sellers and Price Asc says price sorted in ascending mode
If Insert/Modify data comes for a token then one of the query is raised either with Side = 66 or Side = 83 depending on the Side received in the incoming data Packet.
If Delete packet is recieved then Both of the query has to be released back to back.
If I run my executable with Insert/Replace/delete every thing goes well i.e : No packet drop, but the moment I start using Select query either single after Insert/Replace or both after Delete, packet drop starts.
I hope I have been able to describe my whole situation. Running Select query is a must for me. Please help.

Seems you are polling heavily listening for incoming packets but you query routine is just not fast enough.
You are using a powerful Xeon E5 processor capable of doing heavy duty muti-threading but you have configured the sqlite3 in single threaded mode. Try opening the database in multi- threading mode. Better keep the packet listener in one thread so that the main GUI thread remains responsive.Worker threads can easily perform the insert/delete/query database routines.

Related

How to diagnose a visual studio project slowing down as time goes on?

Computer:
Processor: Intel Xeon Silver 4114 CPU # 2.19Ghz (2 processors)
Ram: 96 Gb 2666 Hz: 12 - 8 Gb sticks
OS: Windows 10
GPU: None
Hard drive: Samsung MZVLB512HAJQ-000H2 - 512GB M.2 PCIe NVMe
IDE:
Visual Studio 2019
I am including what I am doing in case it is relevant. I am running a visual studio code where I read data off a GSC PCI SIO4B Sync Card 256K. Using the API for this card (Documentation: http://www.generalstandards.com/downloads/GscApi.1.6.10.1.pdf) I read 150 bytes of data at a speed of 100Hz using the code below. That data is then being split into to the message structure my device. I can’t give info on the message structure but the data is then combined into the various words using a union and added to an integer array int Data[100];
Union Example:
union data_set{
unsigned int integer;
unsigned char input[2];
} word;
Example of how the data is read read:
PLX_PHYSICAL_MEM cpRxBuffer;
#define TEST_BUFFER_SIZE 0x400
//allocates memory for the buffer
cpRxBuffer.Size = TEST_BUFFER_SIZE;
status = GscAllocPhysicalMemory(BoardNum, &cpRxBuffer);
status = GscMapPhysicalMemory(BoardNum, &cpRxBuffer);
memset((unsigned char*)cpRxBuffer.UserAddr, 0xa5, sizeof(cpRxBuffer));
// start data reception:
status = GscSio4ChannelReceivePlxPhysData(BoardNum, iRxChannel, &cpRxBuffer, SetMaxBytes, &messageID);
// wait for Rx operation to complete
status = GscSio4ChannelWaitForTransfer(BoardNum, iRxChannel, 7000, messageID, &amount);
if (status)
{
// If we have an error, "bytesTransferred" will contain the number of bytes that we
// actually transmitted.
DisplayErrorMessage(status);
printf("\n\t%04X bytes out of %04X transferred", amount, SetMaxBytes);
}
My issue is that this code works fine and keeps up for around 5 minutes then randomly it stops being able to keep up and the FIFO (first in first out) register on the PCI card begins to fill up faster than the code can process the data. To me this seems like a memory leak issue since the code works fine for a long time, then starts to slow down when nothing has changed as all the code is doing it reading the data off the card. We used to save the data in a really large array but even after removing that we had the same issue.
I am unsure how to figure out exactly what is happening and I'm hopping for a way to determine if there is a memory leak and how to fix it if there is.
It being a data leak is only a guess though and it very well could be something else that is the problem so any out of the box suggestions for diagnosing the problem are also appreciated.
Similar to Paul's answer, but I like to strategically place two (or more) _CrtMemCheckpoint followed by _CrtMemDifference, to cut down the noise.
Memory leaks can be detected and reported on (in Debug builds) by calling the _CrtDumpMemoryLeaks function. When running under the debugger, this will tell you (in the output tab) how many allocations you have at the time that it is called and the file and line number that each was allocated from.
Call this right at the end of your program, after you (think you) have freed all the resources you use. Anything left over is a candidate for being a leak.

Improve UPDATE-per-second performance of SQLite?

My question comes directly from this one, although I'm only interested on UPDATE and only that.
I have an application written in C/C++ which makes heavy use of SQLite, mostly SELECT/UPDATE, on a very frequent interval (about 20 queries every 0.5 to 1 second)
My database is not big, about 2500 records at the moments, here is the table structure:
CREATE TABLE player (
id INTEGER PRIMARY KEY AUTOINCREMENT,
name VARCHAR(64) UNIQUE,
stats VARBINARY,
rules VARBINARY
);
Up to this point I did not used transactions because I was improving the code and wanted stability rather performance.
Then I measured my database performance by merely executing 10 update queries, the following (in a loop of different values):
// 10 times execution of this
UPDATE player SET stats = ? WHERE (name = ?)
where stats is a JSON of exactly 150 characters and name is from 5-10 characters.
Without transactions, the result is unacceptable: - about 1 full second (0.096 each)
With transactions, the time drops x7.5 times: - about 0.11 - 0.16 seconds (0.013 each)
I tried deleting a large part of the database and/or re-ordering / deleting columns to see if that changes anything but it did not. I get the above numbers even if the database contains just 100 records (tested).
I then tried playing with PRAGMA options:
PRAGMA synchronous = NORMAL
PRAGMA journal_mode = MEMORY
Gave me smaller times but not always, more like about 0.08 - 0.14 seconds
PRAGMA synchronous = OFF
PRAGMA journal_mode = MEMORY
Finally gave me extremely small times about 0.002 - 0.003 seconds but I don't want to use it since my application saves the database every second and there's a high chance of corrupted database on OS / power failure.
My C SQLite code for queries is: (comments/error handling/unrelated parts omitted)
// start transaction
sqlite3_exec(db, "BEGIN TRANSACTION", NULL, NULL, NULL);
// query
sqlite3_stmt *statement = NULL;
int out = sqlite3_prepare_v2(query.c_str(), -1, &statement, NULL);
// bindings
for(size_t x = 0, sz = bindings.size(); x < sz; x++) {
out = sqlite3_bind_text(statement, x+1, bindings[x].text_value.c_str(), bindings[x].text_value.size(), SQLITE_TRANSIENT);
...
}
// execute
out = sqlite3_step(statement);
if (out != SQLITE_OK) {
// should finalize the query no mind the error
if (statement != NULL) {
sqlite3_finalize(statement);
}
}
// end the transaction
sqlite3_exec(db, "END TRANSACTION", NULL, NULL, NULL);
As you see, it's a pretty typical TABLE, records number is small and I'm doing a plain simple UPDATE exactly 10 times. Is there anything else I could do to decrease my UPDATE times? I'm using the latest SQLite 3.16.2.
NOTE: The timings above are coming directly from a single END TRANSACTION query. Queries are done into a simple transaction and i'm
using a prepared statement.
UPDATE:
I performed some tests with transaction enabled and disabled and various updates count. I performed the tests with the following settings:
VACUUM;
PRAGMA synchronous = NORMAL; -- def: FULL
PRAGMA journal_mode = WAL; -- def: DELETE
PRAGMA page_size = 4096; -- def: 1024
The results follows:
no transactions (10 updates)
0.30800 secs (0.0308 per update)
0.30200 secs
0.36200 secs
0.28600 secs
no transactions (100 updates)
2.64400 secs (0.02644 each update)
2.61200 secs
2.76400 secs
2.68700 secs
no transactions (1000 updates)
28.02800 secs (0.028 each update)
27.73700 secs
..
with transactions (10 updates)
0.12800 secs (0.0128 each update)
0.08100 secs
0.16400 secs
0.10400 secs
with transactions (100 updates)
0.088 secs (0.00088 each update)
0.091 secs
0.052 secs
0.101 secs
with transactions (1000 updates)
0.08900 secs (0.000089 each update)
0.15000 secs
0.11000 secs
0.09100 secs
My conclusions are that with transactions there's no sense in time cost per query. Perhaps the times gets bigger with colossal number of updates but i'm not interested in those numbers. There's literally no time cost difference between 10 and 1000 updates on a single transaction. However i'm wondering if this is a hardware limit on my machine and can't do much. It seems i cannot go below ~100 miliseconds using a single transaction and ranging 10-1000 updates, even by using WAL.
Without transactions there's a fixed time cost of around 0.025 seconds.
With such small amounts of data, the time for the database operation itself is insignificant; what you're measuring is the transaction overhead (the time needed to force the write to the disk), which depends on the OS, the file system, and the hardware.
If you can live with its restrictions (mostly, no network), you can use asynchronous writes by enabling WAL mode.
You may still be limited by the time it takes to commit a transaction. In your first example each transaction took about 0.10 to complete which is pretty close to the transaction time for inserting 10 records. What kind of results do you get if you batch 100 or 1000 updates in a single transaction?
Also, SQLite expects around 60 transactions per second on an average hard drive, while you're only getting about 10. Could your disk performance be the issue here?
https://sqlite.org/faq.html#q19
Try adding INDEXEs to your database:
CREATE INDEX IDXname ON player (name)

Performance Issue in Executing Shell Commands

In my application, I need to execute large amount of shell commands via c++ code. I found the program takes more than 30 seconds to execute 6000 commands, this is so unacceptable! Is there any other better way to execute shell commands (using c/c++ code)?
//Below functions is used to set rules for
//Linux tool --TC, and in runtime there will
//be more than 6000 rules to be set from shell
//those TC commans are like below example:
//tc qdisc del dev eth0 root
//tc qdisc add dev eth0 root handle 1:0 cbq bandwidth
// 10Mbit avpkt 1000 cell 8
//tc class add dev eth0 parent 1:0 classid 1:1 cbq bandwidth
// 100Mbit rate 8000kbit weight 800kbit prio 5 allot 1514
// cell 8 maxburst 20 avpkt 1000 bounded
//tc class add dev eth0 parent 1:0 classid 1:2 cbq bandwidth
// 100Mbit rate 800kbit weight 80kbit prio 5 allot 1514 cell
// 8 maxburst 20 avpkt 1000 bounded
//tc class add dev eth0 parent 1:0 classid 1:3 cbq bandwidth
// 100Mbit rate 800kbit weight 80kbit prio 5 allot 1514 cell
// 8 maxburst 20 avpkt 1000 bounded
//tc class add dev eth0 parent 1:1 classid 1:1001 cbq bandwidth
// 100Mbit rate 8000kbit weight 800kbit prio 8 allot 1514 cell
// 8 maxburst 20 avpkt 1000
//......
void CTCProxy::ApplyTCCommands(){
FILE* OutputStream = NULL;
//mTCCommands is a vector<string>
//every string in it is a TC rule
int CmdCount = mTCCommands.size();
for (int i = 0; i < CmdCount; i++){
OutputStream = popen(mTCCommands[i].c_str(), "r");
if (OutputStream){
pclose(OutputStream);
} else {
printf("popen error!\n");
}
}
}
UPDATE
I tried to put all the shell commands into a shell script and let the test app call this script file using system("xxx.sh"). This time it takes 24 seconds to execute all 6000 entries of shell commands, less than what we toke before. But this is still much bigger than what we expected! Is there any other way that can decrease the execution time to less than 10 seconds?
So, most likely (based on my experience in a similar type of thing), the majority of the time is spent starting a new process running a shell, the execution of the actual command in the shell is very short. (And 6000 in 30 seconds doesn't sound too terrible, actually).
There are a variety of ways you could do this. I'd be tempted to try to combine it all into one shell script, rather than running individual lines. This would involve writing all the 'tc' strings to a file, and then passing that to popen().
Another thought is if you can actually combine several strings together into one execute, perhaps?
If the commands are complete and directly executable (that is, no shell is needed to execute the program), you could also do your own fork and exec. This would save creating a shell process, which then creates the actual process.
Also, you may consider running a small number of processes in parallel, which on any modern machine will likely speed things up by the number of processor cores you have.
You can start shell (/bin/sh) and pipe all commands there parsing the output. Or you can create a Makefile as this would give you more control on how the commands whould be executed, parallel execution and error handling.

Realtime receiving of UDP packets with QNX RTOS

I have a source which sends UDP packets at a rate of 819.2 Hz (~1.2ms) to my QNX Neutrino machine. I want to receive and process those messages with as little delay and jitter as possible.
My first code was basically:
SetupUDPSocket();
while (true) {
recv(socket, buffer, BufferSize, MSG_WAITALL); // blocks until whole packet is received
processPacket(buffer);
}
The problem is that recv() only checks at each timer tick of the system if there is a new packet available. The timer tick is usually 1ms. So, if I use this I will get a huge jitter, because I process a packet every 1ms or every 2ms. I could reset the size of the timer ticks, but that would affect the whole system (and other timers of other processes, etc). And I still would have a jitter, because I certainly would never exactly match the 819.2 Hz.
So, I tried to use the interrupt line of the network card (5). But it seems as there are also other things which causes the interrupt to rise. I used to following code:
ThreadCtl(_NTO_TCTL_IO, 0);
SIGEV_INTR_INIT(&event);
iID = InterruptAttachEvent(IRQ5, &event, _NTO_INTR_FLAGS_TRK_MSK);
while(true) {
if (InterruptWait(0, NULL) == -1) {
std::cerr << "errno: " << errno << std::endl;
}
length = recv(socket, buffer, bufferSize, 0); // non-blocking this time
LogTimeAndLength();
InterruptUnmask(IRQ5, iID;
}
This results in a single succesful read in the beginning, followed by reads with 0 byte length after 0 time passing. It seems, that after do the InterruptUnmask(), the InterruptWait() does not wait at all, so there must already be a new interrupt (or the same?!).
Is it possible to do something like that with the interrupt line of the network card? Are there any other possibilties to receive the packets at a rate of 819.2 Hz?
Some information about the network card:
'pci -vvv' outputs:
Class = Network (Ethernet)
Vendor ID = 8086h, Intel Corporation
Device ID = 107ch, 82541PI Gigabit Ethernet Controller
PCI index = 0h
Class Codes = 020000h
Revision ID = 5h
Bus number = 4
Device number = 15
Function num = 0
Status Reg = 230h
Command Reg = 17h
I/O space access enabled
Memory space access enabled
Bus Master enabled
Special Cycle operations ignored
Memory Write and Invalidate enabled
Palette Snooping disabled
Parity Error Response disabled
Data/Address stepping disabled
SERR# driver disabled
Fast back-to-back transactions to different agents disabled
Header type = 0h Single-function
BIST = 0h Build-in-self-test not supported
Latency Timer = 40h
Cache Line Size= 8h un-cacheable
PCI Mem Address = febc0000h 32bit length 131072 enabled
PCI Mem Address = feba0000h 32bit length 131072 enabled
PCI IO Address = ec00h length 64 enabled
Subsystem Vendor ID = 8086h
Subsystem ID = 1376h
PCI Expansion ROM = feb80000h length 131072 disabled
Max Lat = 0ns
Min Gnt = 255ns
PCI Int Pin = INT A
Interrupt line = 5
CPU Interrupt = 5h
Capabilities Pointer = dch
Capability ID = 1h - Power Management
Capabilities = c822h - 28002000h
Capability ID = 7h - PCI-X
Capabilities = 2h - 400000h
Device Dependent Registers:
0x040: 0000 0000 0000 0000 0000 0000 0000 0000
...
0x0d0: 0000 0000 0000 0000 0000 0000 01e4 22c8
0x0e0: 0020 0028 0700 0200 0000 4000 0000 0000
0x0f0: 0500 8000 0000 0000 0000 0000 0000 0000
and 'nicinfo' outputs:
wm1:
INTEL 82544 Gigabit (Copper) Ethernet Controller
Physical Node ID ........................... 000E0C C5F6DD
Current Physical Node ID ................... 000E0C C5F6DD
Current Operation Rate ..................... 100.00 Mb/s full-duplex
Active Interface Type ...................... MII
Active PHY address ....................... 0
Maximum Transmittable data Unit ............ 1500
Maximum Receivable data Unit ............... 0
Hardware Interrupt ......................... 0x5
Memory Aperture ............................ 0xfebc0000 - 0xfebdffff
Promiscuous Mode ........................... Off
Multicast Support .......................... Enabled
Thanks for reading!
I am quite not sure why the statement "The problem is that recv() only checks at each timer tick of the system if there is a new packet available. The timer tick is usually 1ms." would be true for preemptive OS. There must be something in the system configuration or the network protocol stack implementation has some issues.
Years ago when I was working on some IPTV STB project for Yahoo BB Japan, i got an issue in RTP receiving. The issues is not delay or jitter, but the overall system performance in the STB after we add some NDS algorithm. We are using vxWorks, and vxWorks support ethernet hook interface, which will be called each time a ethernet packet is received by the driver.
I hook an API into it and just parse the UDP with specified port from the ethernet packets directly. Of course we have some assumption that there is no fragmentation, which is guaranteed by the network setup for performance issues. Maybe you can also check to see if you can get the same hook in the QNX ethernet driver. At lease you found out if the jitter comes from driver or not.
How big are your UDP packets ? If the packet size is small you will gain greater efficiency by packing more data into single packet and decreasing transmission rate.
I suspect the interrupt service routing (ISR) is not masking the interrupt. Perhaps it is designed for edge-sensitivity and the interrupt is level-sensitive.
sorry I'm a bit late to the party, but I came across your question and saw that it was similar to a situation I encountered. Instead of hardware interrupts, you could try a software interrupt using signals. QNX has some documentation here: http://www.qnx.com/developers/docs/qnx_4.25_docs/qnx4/sysarch/microkernel.html#IPCSIGNALS . I was using CentOS at the time but the theory is the same. According to http://www.qnx.com/developers/docs/6.3.0SP3/neutrino/lib_ref/s/socket.html you can use ioctl() to set up a receive group for the SIGIO signal for a given file descriptor...in your case a UDP socket. When the socket has data that is ready for reading, a SIGIO signal is sent to the process indicated by ioctl(). Use sigaction() to tell the OS what signal handling function to use. In your case, the signal handler can read the data off the socket and store it in a buffer for processing. Use pause() to suspend the process until it handles the SIGIO signal. When the signal handler returns, the process will wake up and you can process the data in the buffer.
That should allow you to process your data as it comes in without having to deal with timers or hardware interrupts. One thing to be aware of is that your system can process those signals as fast as the UDP traffic is coming in.

incomprehensible time consumed in using memory mapped file

I am writing a routine to compare two files using memory-mapped file. In case files are too big to be mapped at one go. I split the files and map them part by part. For example, to map a 1049MB file, I split it into 512MB + 512MB + 25MB.
Every thing works fine except one thing: it always take much, much longer to compare the remainder (25MB in this example), though the code logic is exactly the same. 3 observations:
it does not matter which is compared first, whether the main part (512MB * N) or the remainder (25MB in this example) comes first, the result remains the same
the extra time in the remainder seems to be spent in the user mode
Profiling in VS2010 beta 1 shows, the time is spent inside t std::_Equal(), but this function is mostly (profiler says 100%) waiting for I/O and other threads.
I tried
changing the VIEW_SIZE_FACTOR to another value
replacing the lambda functor with a member function
changing the file size under test
changing the order of execution of the remainder to before/after the loop
The result was quite consistent: it takes a lot more time in the remainder part and in the User Mode.
I suspect it has something to do with the fact that the mapped size is not a multiple of mapping alignment (64K on my system), but not sure how.
Below is the complete code for the routine and a timing measured for a 3G file.
Can anyone please explain it, Thanks?
// using memory-mapped file
template <size_t VIEW_SIZE_FACTOR>
struct is_equal_by_mmapT
{
public:
bool operator()(const path_type& p1, const path_type& p2)
{
using boost::filesystem::exists;
using boost::filesystem::file_size;
try
{
if(!(exists(p1) && exists(p2))) return false;
const size_t segment_size = mapped_file_source::alignment() * VIEW_SIZE_FACTOR;
// lanmbda
boost::function<bool(size_t, size_t)> segment_compare =
[&](size_t seg_size, size_t offset)->bool
{
using boost::iostreams::mapped_file_source;
boost::chrono::run_timer t;
mapped_file_source mf1, mf2;
mf1.open(p1, seg_size, offset);
mf2.open(p2, seg_size, offset);
if(! (mf1.is_open() && mf2.is_open())) return false;
if(!equal (mf1.begin(), mf1.end(), mf2.begin())) return false;
return true;
};
boost::uintmax_t size = file_size(p1);
size_t round = size / segment_size;
size_t remainder = size & ( segment_size - 1 );
// compare the remainder
if(remainder > 0)
{
cout << "segment size = "
<< remainder
<< " bytes for the remaining round";
if(!segment_compare(remainder, segment_size * round)) return false;
}
//compare the main part. take much less time, even
for(size_t i = 0; i < round; ++i)
{
cout << "segment size = "
<< segment_size
<< " bytes, round #" << i;
if(!segment_compare(segment_size, segment_size * i)) return false;
}
}
catch(std::exception& e)
{
cout << e.what();
return false;
}
return true;
}
};
typedef is_equal_by_mmapT<(8<<10)> is_equal_by_mmap; // 512MB
output:
segment size = 354410496 bytes for the remaining round
real 116.892s, cpu 56.201s (48.1%), user 54.548s, system 1.652s
segment size = 536870912 bytes, round #0
real 72.258s, cpu 2.273s (3.1%), user 0.320s, system 1.953s
segment size = 536870912 bytes, round #1
real 75.304s, cpu 1.943s (2.6%), user 0.240s, system 1.702s
segment size = 536870912 bytes, round #2
real 84.328s, cpu 1.783s (2.1%), user 0.320s, system 1.462s
segment size = 536870912 bytes, round #3
real 73.901s, cpu 1.702s (2.3%), user 0.330s, system 1.372s
More observations after the suggestions by responders
Further split the remainder into body and tail(remainder = body + tail), where
body = N * alignment(), and tail < 1 * alignment()
body = m * alignment(), and tail < 1 * alignment() + n * alignment(), where m is even.
body = m * alignment(), and tail < 1 * alignment() + n * alignment(), where m is exponents of 2.
body = N * alignment(), and tail = remainder - body. N is random.
the total time remains unchanged, but I can see that the time does not necessary relate to tail, but to size of body and tail. the bigger part takes more time. The time is USER TIME, which is most incomprehensible to me.
I also look at the pages faults through Procexp.exe. the remainder does NOT take more faults than the main loop.
Updates 2
I've performed some test on other workstations, and it seem the issue is related to the hardware configuration.
Test Code
// compare the remainder, alternative way
if(remainder > 0)
{
//boost::chrono::run_timer t;
cout << "Remainder size = "
<< remainder
<< " bytes \n";
size_t tail = (alignment_size - 1) & remainder;
size_t body = remainder - tail;
{
boost::chrono::run_timer t;
cout << "Remainder_tail size = " << tail << " bytes";
if(!segment_compare(tail, segment_size * round + body)) return false;
}
{
boost::chrono::run_timer t;
cout << "Remainder_body size = " << body << " bytes";
if(!segment_compare(body, segment_size * round)) return false;
}
}
Observation:
On another 2 PCs with the same h/w configurations with mine, the result is consistent as following:
------VS2010Beta1ENU_VSTS.iso [1319909376 bytes] ------
Remainder size = 44840960 bytes
Remainder_tail size = 14336 bytes
real 0.060s, cpu 0.040s (66.7%), user 0.000s, system 0.040s
Remainder_body size = 44826624 bytes
real 13.601s, cpu 7.731s (56.8%), user 7.481s, system 0.250s
segment size = 67108864 bytes, total round# = 19
real 172.476s, cpu 4.356s (2.5%), user 0.731s, system 3.625s
However, running the same code on a PC with a different h/w configuration yielded:
------VS2010Beta1ENU_VSTS.iso [1319909376 bytes] ------
Remainder size = 44840960 bytes
Remainder_tail size = 14336 bytes
real 0.013s, cpu 0.000s (0.0%), user 0.000s, system 0.000s
Remainder_body size = 44826624 bytes
real 2.468s, cpu 0.188s (7.6%), user 0.047s, system 0.141s
segment size = 67108864 bytes, total round# = 19
real 65.587s, cpu 4.578s (7.0%), user 0.844s, system 3.734s
System Info
My workstation yielding imcomprehensible timing:
OS Name: Microsoft Windows XP Professional
OS Version: 5.1.2600 Service Pack 3 Build 2600
OS Manufacturer: Microsoft Corporation
OS Configuration: Member Workstation
OS Build Type: Uniprocessor Free
Original Install Date: 2004-01-27, 23:08
System Up Time: 3 Days, 2 Hours, 15 Minutes, 46 Seconds
System Manufacturer: Dell Inc.
System Model: OptiPlex GX520
System type: X86-based PC
Processor(s): 1 Processor(s) Installed.
[01]: x86 Family 15 Model 4 Stepping 3 GenuineIntel ~2992 Mhz
BIOS Version: DELL - 7
Windows Directory: C:\WINDOWS
System Directory: C:\WINDOWS\system32
Boot Device: \Device\HarddiskVolume2
System Locale: zh-cn;Chinese (China)
Input Locale: zh-cn;Chinese (China)
Time Zone: (GMT+08:00) Beijing, Chongqing, Hong Kong, Urumqi
Total Physical Memory: 3,574 MB
Available Physical Memory: 1,986 MB
Virtual Memory: Max Size: 2,048 MB
Virtual Memory: Available: 1,916 MB
Virtual Memory: In Use: 132 MB
Page File Location(s): C:\pagefile.sys
NetWork Card(s): 3 NIC(s) Installed.
[01]: VMware Virtual Ethernet Adapter for VMnet1
Connection Name: VMware Network Adapter VMnet1
DHCP Enabled: No
IP address(es)
[01]: 192.168.75.1
[02]: VMware Virtual Ethernet Adapter for VMnet8
Connection Name: VMware Network Adapter VMnet8
DHCP Enabled: No
IP address(es)
[01]: 192.168.230.1
[03]: Broadcom NetXtreme Gigabit Ethernet
Connection Name: Local Area Connection 4
DHCP Enabled: Yes
DHCP Server: 10.8.0.31
IP address(es)
[01]: 10.8.8.154
Another workstation yielding "correct" timing:
OS Name: Microsoft Windows XP Professional
OS Version: 5.1.2600 Service Pack 3 Build 2600
OS Manufacturer: Microsoft Corporation
OS Configuration: Member Workstation
OS Build Type: Multiprocessor Free
Original Install Date: 5/18/2009, 2:28:18 PM
System Up Time: 21 Days, 5 Hours, 0 Minutes, 49 Seconds
System Manufacturer: Dell Inc.
System Model: OptiPlex 755
System type: X86-based PC
Processor(s): 1 Processor(s) Installed.
[01]: x86 Family 6 Model 15 Stepping 13 GenuineIntel ~2194 Mhz
BIOS Version: DELL - 15
Windows Directory: C:\WINDOWS
System Directory: C:\WINDOWS\system32
Boot Device: \Device\HarddiskVolume1
System Locale: zh-cn;Chinese (China)
Input Locale: en-us;English (United States)
Time Zone: (GMT+08:00) Beijing, Chongqing, Hong Kong, Urumqi
Total Physical Memory: 3,317 MB
Available Physical Memory: 1,682 MB
Virtual Memory: Max Size: 2,048 MB
Virtual Memory: Available: 2,007 MB
Virtual Memory: In Use: 41 MB
Page File Location(s): C:\pagefile.sys
NetWork Card(s): 3 NIC(s) Installed.
[01]: Intel(R) 82566DM-2 Gigabit Network Connection
Connection Name: Local Area Connection
DHCP Enabled: Yes
DHCP Server: 10.8.0.31
IP address(es)
[01]: 10.8.0.137
[02]: VMware Virtual Ethernet Adapter for VMnet1
Connection Name: VMware Network Adapter VMnet1
DHCP Enabled: Yes
DHCP Server: 192.168.154.254
IP address(es)
[01]: 192.168.154.1
[03]: VMware Virtual Ethernet Adapter for VMnet8
Connection Name: VMware Network Adapter VMnet8
DHCP Enabled: Yes
DHCP Server: 192.168.2.254
IP address(es)
[01]: 192.168.2.1
Any explanation theory? Thanks.
This behavior looks quite illogical. I wonder what would happen if we tried something stupid. Provided the overall file is larger than 512MB you could compare again a full 512MB for the last part instead of the remaining size.
something like:
if(remainder > 0)
{
cout << "segment size = "
<< remainder
<< " bytes for the remaining round";
if (size > segment_size){
block_size = segment_size;
offset = size - segment_size;
}
else{
block_size = remainder;
offset = segment_size * i
}
if(!segment_compare(block_size, offset)) return false;
}
It seems a really dumb thing to do because we would be comparing part of the file two times but if your profiling figures are accurate it should be faster.
It won't give us an answer (yet) but if it is indeed faster it means the response we are looking for lies in what your program does for small blocks of data.
How fragmented is the file you are comparing with? You can use FSCTL_GET_RETRIEVAL_POINTERS to get the ranges that the file maps to on disk. I suspect the last 25 MB will have a lot of small ranges to account for the performance you have measured.
I wonder if mmap behaves strangely when a segment isn't an even number of pages in size? Maybe you can try handling the last parts of the file by progressively halving your segment sizes until you get to a size that's less than mapped_file_source::alignment() and handling that last little bit specially.
Also, you say you're doing 512MB blocks, but your code sets the size to 8<<10. It then multiplies that by mapped_file_source::alignment(). Is mapped_file_source::alignment() really 65536?
I would recommend, to be more portable and cause less confusion, that you simply use the size as given in the template parameter and simply require that it be an even multiple of mapped_file_source::alignment() in your code. Or have people pass in the power of two to start at for the block size, or something. Having the block size passed in as a template parameter then be multiplied by some strange implementation defined constant seems a little odd.
I know this isn't an exact answer to your question; but have you tried side-stepping the entire problem - i.e. just map the entire file in one go?
I know little about Win32 memory management; but on Linux you can use the MAP_NORESERVE flag with mmap(), so you don't need to reserve RAM for the entire filesize. Considering you are just reading from both files the OS should be able to throw away pages at any time if it gets short of RAM...
I would try it on a Linux or BSD just to see how it acts, out of curiousity.
I have a really rough guess about the problem:
I bet that Windows is doing a lot of extra checks to make sure it doesn't map past the end of the file. In the past there have been security problems in some OS's that allowed a mmap user to view filesystem-private data or data from other files in the area just past the end of the map, so being careful here is a good idea for a OS designer. So Windows may be using a much more careful "copy data from disk to kernel, zero out unmapped data, copy data to user" instead of the much faster "copy data from disk to user".
Try mapping to just under the end of the file, excluding the last bytes that don't fit into a 64K block.
Could it be that a virus scanner is causing these strange results? Have you tried without virus scanner?
Regards,
Sebastiaan