Here's a piece of code I've spent the last 2 days optimizing and profiling because it was taking too much time:
{
mongo::ScopedDbConnection _dbConnection (DbHost);
_dbConnection->insert(TokensDB, tokensArray );
_dbConnection.done();
}
{
mongo::ScopedDbConnection _dbConnection (DbHost);
_dbConnection->insert(IdxDB, postingsArray);
_dbConnection.done();
}
Here postingsArray is std::vector<BSON (int64_t, int64_t, int64_t, int)>, 20 000 elements. This insert always takes only a couple of milliseconds. tokensArray is std::vector<BSON (int64_t, std::string)>, 5000 elements. This is the odd insert.
If I do it exactly as in the code fragment above, it takes 45-50 ms. But if I switch the two blocks around as it initially was (insert to IdxDB first and TokensDB second) it takes 400-500 ms. What is going on here? Why does order matter? Why is inserting 5000 2-field records taking much longer than inserting 20k 4-field objects?
My initial idea is it's because of std::string field (it holds single english word, so about 5-7 symbols on average). I've replaced it with random int64_t number - no noticeable change in insert completion time.
All the profiling is done on a clean database and with exactly the same data every time, I don't believe it's my error in organizing the measurements.
MongoDB performs a lot of things in the background so it is normal that the insertion of the large postingsArray takes little time but affects the performance after that. When you measure the postingsArray insert alone you are only measuring the time it takes for the MongoDB driver to accept the insert. But when you measure the consequent operations you begin to notice the background workload started by the postingsArray insert.
See point 6 there: http://article.gmane.org/gmane.comp.db.mongodb.user/818
BTW, the way your example written I would suspect MongoDB gives you the same connection for the inserts. (E.g. you might be taking a connection from the pool, inserting the postingsArray with it, releasing it, then taking the same connection from the pool again and inserting the tokensArray with it). In that case the TCP/IP socket might still be busy with the postingsArray insert and what you're seeing might be hitting the limit on the TCP/IP buffer.
P.S. You might want to change the write concern in order to measure the actual time it takes for the MongoDB to perform the insert: http://article.gmane.org/gmane.comp.db.mongodb.user/68288
Related
What do I want to do?
I have written a program which reads data from binary files and does calculation based on the read values. Execution time is most import for this program. To validate that my program is operating within the specified time limits, I tried to log all the calculations by storing them inside a std::vector<std::string>. And after the time critical execution is done, I write this vector to a file.
What is stored inside the vector?
In the vector I write the execution time (std::chrono:steady_clock.now()) and the current clock time (std::chrono::system_clock::now() with date.h by Howard Hinnant).
What did I observe?
While analyzing the results I stumble over the following pattern. Independent on the input data the mean execution time of 0.003ms for one operation explodes to ~20ms for a single operation at one specific reproducible index. After this, the execution time of all operations goes back to 0.003ms. The index of the execution time explosion is every time 2097151. Since 2^21 equals 2097152, something happens at 2^21 that slows down the entire program. The same effect can be observed with 2^22 and 2^23. Even more interesting is that the lag is doubled (2^21 = ~20ms, 2^22 = ~43ms, 2^23 =~81ms ). I googled about this specific number and the only thing I found was some node.js stuff which uses c++ under the hood.
What do I suspect?
At index 2^21 a memory area must be expanded, and that is why the delay occurs.
Questions
Is my assumption correct and the size of the vector is the problem?
How can I debug such a Phenomenon? (To be certain, that purely the vector is the problem)
Can I allocate enough memory beforehand to avoid the memory expansion?
What could I use instead of a std::vector, which supports > 10.000.000.000 elements?
I was able to solve my problem by reserving memory by using std::vector::reserve() before the time critical part of my program. Thanks to all the comments.
Here the working code I used:
std::vector<std::string> myLogVector;
myLogVector.reserve(12000000);
//...do time critical stuff, without reallocating storage
I'm writing a c++ application, in which I'll receive 4096 bytes of data for every 0.5 seconds. This is processed and the output will be sent to some other application. Processing each set of data is taking nearly 2 seconds.
This is how exactly I'm doing this.
In my main function, I'm receiving the data and pushing it into a vector.
I've created a thread, which will always process the first element and deletes it immediately after processing. Below is the simulation of my application receiving part.
#include<iostream>
#include <unistd.h>
#include <vector>
#include <mutex>
#include <pthread.h>
using namespace std;
struct Student{
int id;
int age;
};
vector<Student> dustBin;
pthread_mutex_t lock1;
bool isEven=true;
void *processData(void* arg){
Student st1;
while(true)
{
if(dustBin.size())
{
printf("front: %d\tSize: %d\n",dustBin.front(),dustBin.size());
st1 = dustBin.front();
cout << "Currently Processing ID "<<st1.id<<endl;
sleep(2);
pthread_mutex_lock(&lock1);
dustBin.erase(dustBin.begin());
cout<<"Deleted"<<endl;
pthread_mutex_unlock(&lock1);
}
}
return NULL;
}
int main()
{
pthread_t ptid;
Student st;
dustBin.clear();
pthread_mutex_init(&lock1, NULL);
pthread_create(&ptid, NULL, &processData, NULL);
while(true)
{
for(int i=0; i<4096; i++)
{
st.id = i+1;
st.age = i+2;
pthread_mutex_lock(&lock1);
dustBin.push_back(st);
printf("Pushed: %d\n",st.id);
pthread_mutex_unlock(&lock1);
usleep(500000);
}
}
pthread_join(ptid, NULL);
pthread_mutex_destroy(&lock1);
}
The output of this code is
Output
In the output image posted here, you can observe the exact sequence of the processing. It is processing only one item for every 4 insertions.
Note that the reception time of data <<< processing time.
Because of this reason, my input buffer is growing very rapidly. And one more thing is that as the main thread and the processData thread are using a mutex, they are dependent on each other for the lock to release. Because of this reason my incoming buffer is getting locked sometimes leading to data misses. Please, someone, suggest to me how to handle this or suggest me some method to do.
Thanks & Regards
Vamsi
Undefined behavior
When you read data, you must lock before getting the size.
Busy waiting
You should always avoid tight loop that does nothing. Here if dustBin is empty, you will immediately check it against forever which will use 100% of that core and slow down everything else, drain the laptop battery and make it hotter than it should be. Very bad idea to write such code!
Learn multithreading first
You should read a book or 2 on multithreading. Doing multithreading right is hard and almost impossible without taking time to learn it properly. C++ Concurrency in Action is highly recommended for standard C++ multithreading.
Condition variable
Usually you will use a condition variable or some sort of event to tell the consumer thread when data is added so it does not have to wake up uselessly to check if it is the case.
Since you have a typical producer/consumer, you should be able to find lot of information on how to do it or special containers or other constructs that will help implement your code.
Output
Your printf and cout stuff will have an impact on the performance and since some are inside a lock and other not, you will probably get an improperly formatted output. If you really need output, a third thread might be a better option. In any case, you want to minimize the time you have a lock so formatting into a temporary buffer might be a good idea too.
By the way, standard output is relatively slow and it is perfectly possible that it might even be the reason why you are not able to process rapidly all data.
Processing rate
Obviously if you are able to produce 4096 bytes of data every 0.5 second but need 2 seconds to process that data, you have a serious problem.
You should really think about what you want to do in such case before asking a question here as without that information, we are making guess about possible solutions.
Here are some possibilities:
Slow down the producer. Obviously, this does not work if you get data in real time.
Optimize the consumer (better algorithms, better hardware, optimal parallelism…)
Skip some data
Obviously for performance problems, you should use a profiler to know were you lost your time. Once you know that, you will have a better idea where to check to improve you code.
Taking 2 seconds to process the data is really slow but we cannot help you since we have no idea of what your code is doing.
For example, if you add the data into a database and it is not able to follow up, you might want to batch multiple insert into a single command to reduce the overhead of communicating with the database over the network.
Another example, would be if you append the data to a file, you might want to keep the file open and accumulate some data before doing each write.
Container
A vector would not be a good choice if you remove item from the head one by one and it size become somewhat large (say more than 100 small items) as every other item need to be moved every time.
In addition to changing the container as suggested in a comment, another possibility would be to use 2 vectors and swap them. That way, you will be able to reduce the number of time you lock the mutex and process many item without needing a lock.
How to optimize
You should accumulate enough data (say 30 seconds), stop accumulating and then test your processing speed with that data. If you cannot process that data in less that about half the time (15 seconds), then you clearly need to improve your processing speed one way or another. One your consumer(s) is (are) fast enough, then you could optimize communication from the producer to the consumer(s).
You have to know if your bottleneck is I/O, database or what and if some part might be done in parallel.
There are probably a lot of optimization that can be done in the code you have not shown...
If you can't handle messages fast enough, you have to drop some.
Use a circular buffer of a fixed size.
Then if the provider is faster than the consumer, older entries will be overwritten.
If you cannot skip some data and you cannot process it fast enough, you are doomed.
Create two const variables, NBUFFERS and NTHREADS, make them both 8 initially if you have 16 cores and your processing is 4x too slow. Play with these values later.
Create NBUFFERS data buffers, each big enough to hold 4096 samples, In practice, just create a single large buffer and make offsets into it to divide it up.
Start NTHREADS. They will each continuously wait to be told which buffer to process and then they will process it and wait again for another buffer.
In your main program, go into a loop, receiving data. Receive the first 4096 samples into the first buffer and notify the first thread. Receive the second 4096 samples into the second buffer and notify the second thread.
buffer = (buffer + 1) % NBUFFERS
thread = (thread + 1) % NTHREADS
Rinse and repeat. As you have 8 threads, and data only arrives every 0.5 seconds, each thread will only get a new buffer every 4 seconds but only needs 2 seconds to clear the previous buffer.
gprof says that my high computing app spends 53% of its time inside std::vector <...> operator [] (unsigned long), 32% of which goes to one heavily used vector. Worse, I suspect that my parallel code failing to scale beyond 3-6 cores is due to a related memory bottleneck. While my app does spend a lot of time accessing and writing memory, it seems like I should be able (or at least try) to do better than 52%. Should I try using dynamic arrays instead (size remains constant in most cases)? Would that be likely to help with possible bottlenecks?
Actually, my preferred solution would be to solve the bottleneck and leave the vectors as is for convenience. Based on the above, are there any likely culprits or solutions (tcmalloc is out)?
Did you examine your memory access pattern itself? It might be inefficient - cache unfriendly.
Did you try to use raw pointer while array accessing?
// regular place
for (int i = 0; i < arr.size(); ++i)
wcout << arr[i];
// In bottleneck
int *pArr = &arr.front();
for (int i = 0; i < arr.size(); ++i)
wcout << pArr[i];
I suspect that gprof prevents functions to be inlined. Try to use another profiling method. std::vector operator [] cannot be bottleneck because it doesn't differ much from raw array access. SGI implementaion is shown below:
reference operator[](size_type __n) { return *(begin() + __n); }
iterator begin() { return _M_start; }
You cannot trust gprof for high-speed code profiling, you should instead use a passive profiling method like oprofile to get the real picture.
As an alternative you could profile by manual code alteration (e.g. calling a computation 10 times instead of one and checking how much the execution time increases). Note that this is however going to be influenced by cache issues so YMMV.
The vector class is heavily liked and provides a certain amount of convenience, at the expense of performance, which is fine when you don't particularly need performance.
If you really need performance, it won't hurt you too much to bypass the vector class and go directly to a simple old hand-made array, whether statically or dynamically allocated. Then 1) the time you currently spend indexing should essentially disappear, speeding up your app by that amount, and 2) you can move on to whatever the "next big thing" is that takes time in your app.
EDIT:
Most programs have a lot more room for speedup than you might suppose. I made a walk-through project to illustrate this. If I can summarize it really quickly, it goes like this:
Original time is 2.7 msec per "job" (the number of "jobs" can be varied to get enough run-time to analyze it).
First cut showed roughly 60% of time was spent in vector operations, including indexing, appending, and removing. I replaced with a similar vector class from MFC, and time decreased to 1.8 msec/job. (That's a 1.5x or 50% speedup.)
Even with that array class, roughly 40% of time was spent in the [] indexing operator. I wanted it to index directly, so I forced it to index directly, not through the operator function. That reduced time to 1.5 msec/job, a 1.2x speedup.
Now roughly 60% of the time is adding/removing items in arrays. An additional fraction was spent in "new" and "delete". I decided to chuck the arrays and do two things. One was to use do-it-yourself linked lists, and to pool used objects. The first reduced time to 1.3 msec (1.15x). The second reduced it to 0.44 msec (2.95x).
Of that time, I found that about 60% of the time was in code I had written to do indexing into the list (as if it were an array). I decided that could be done instead just by having a pointer directly into the list. Result: 0.14 msec (3.14x).
Now I found that nearly all the time was being spent in a line of diagnostic I/O I was printing to the console. I decided to get rid of that: 0.0037 msec (38x).
I could have kept going, but I stopped.
The overall time per job was reduced by a compounded factor of about 700x.
What I want you to take away is if you need performance bad enough to deviate from what might be considered the accepted ways of doing things, you don't have to stop after one "bottleneck".
Just because you got a big speedup doesn't mean there are no more.
In fact the next "bottleneck" might be bigger than the first, in terms of speedup factor.
So raise your expectations of speedup you can get, and go for broke.
i'm trying to build a hash with berkeley db, which shall contain many tuples (approx 18GB of key value pairs), but in all my tests the performance of the insert operations degrades drastically over time. I've written this script to test the performance:
#include<iostream>
#include<db_cxx.h>
#include<ctime>
#define MILLION 1000000
int main () {
long long a = 0;
long long b = 0;
int passes = 0;
int i = 0;
u_int32_t flags = DB_CREATE;
Db* dbp = new Db(NULL,0);
dbp->set_cachesize( 0, 1024 * 1024 * 1024, 1 );
int ret = dbp->open(
NULL,
"test.db",
NULL,
DB_HASH,
flags,
0);
time_t time1 = time(NULL);
while ( passes < 100 ) {
while( i < MILLION ) {
Dbt key( &a, sizeof(long long) );
Dbt data( &b, sizeof(long long) );
dbp->put( NULL, &key, &data, 0);
a++; b++; i++;
}
DbEnv* dbep = dbp->get_env();
int tmp;
dbep->memp_trickle( 50, &tmp );
i=0;
passes++;
std::cout << "Inserted one million --> pass: " << passes << " took: " << time(NULL) - time1 << "sec" << std::endl;
time1 = time(NULL);
}
}
Perhaps you can tell me why after some time the "put" operation takes increasingly longer and maybe how to fix this.
Thanks for your help,
Andreas
You may want to look at the information provided by the db_stat utility and the HASH-specific tuning functions that are available. Please see BDB Reference Guide section on configuring a HASH database.
I would expect you to get 10s of thousands of inserts per second on commodity hardware. What are you experiencing and what is your performance target?
Regards,
Dave
I would recommend trying the bulk insert API, you can read about that in the documentation here:
http://www.oracle.com/technology/documentation/berkeley-db/db/api_reference/CXX/dbput.html#put_DB_MULTIPLE_KEY
Also, I would guess that your call to memp_trickle is responsible for most of the slowdown. As the cache becomes dirtier, finding pages to trickle becomes more expensive. In fact, since you are only writing, having a large cache only hurts (once you've written the data, you don't use it again, so you don't want it to hang around in the cache.) I would recommend testing different (smaller) cache sizes.
Finally, if your sole concern is insert performance, using a larger page size will help. You'll be able to fit more data on each page and that will result in fewer disk writes.
-Ben
The memp_trickle is almost certainly slowing things down. It's often good to use trickle, but it belongs in its own thread to be effective. BDB (except when you get into the higher level replication APIs) creates no threads for you -- nothing happens behind the scenes (thread-wise). trickle will be effective when you have dirty pages forced from the cache (look at a stat output to see if that's what's happening).
You might also consider using a BTREE instead of a HASH. Yeah, I know you specifically said hash, but why? If you are looking to maximize performance, why add that restriction? You may be able to take advantage of locality of reference to reduce your cache footprint - there is often much more locality that you believe, or perhaps you can create some -- if you generate keys that are random digits, for example, prepend the date and time. That usually introduces locality into a perceived 'random' system. If you do use btree, you'll need to pay some attention to byte ordering for your keys for your system (look up Endianness in Wikipedia), if you are using a Little Endian system, you'll need to swap the bytes. Using BTREE with the right ordering and introduced locality means your key/value pairs will be stored in 'key-generation-time' order, so if you see the most action on the recent keys, you'll be tending to hit the same pages over and over (see your cache hit rate in your stats). So you'll need less cache. Another way to think of it is with the same cache amount, your solution will scale by a larger multiple.
I expect your actual app really doesn't insert integers keys in order (if it does, you'll be lucky). So you should write a benchmark that closely simulates your access patterns, at least with respect to: size of key, size of data, access pattern, number of items in the database, mix of read/write. Once you have that, look at the stats -- pay close attention to anything that implies IO or contention.
BTW, I've recently started a blog at http://libdb.wordpress.com to discuss BDB performance tuning (and other matters related to BDB). You may get some good ideas there. There can be a huge difference of latency and throughput depending on the kind of tuning you do. In particular, see http://libdb.wordpress.com/2011/01/31/revving-up-a-benchmark-from-626-to-74000-operations-per-second/
Your performance degrade can have several reasons, which aren't actually connected with your code. I may be mistaken, but I think this is all about internal database structure (and used data structure).
Think of a situation where database uses some approach other than a hash table, for example an RB Tree. Inserting in that tree would take O(logN) in Big-O sense and every inserted element increases the time needed for the next insert.
Unfortunately, the same could happen with a plain hash-table, so that initial O(1) insertion operation time degrades to something worse. This could have several reasons, but it's all about hash collisions, which could happen due to wrong hash function, wrong data (that is bad for currently used hash function) or even due to the moon phase.
If I were you, I would try to dig into your db internal structure. Also, I think testing your keys with something other than your db (e.g boost::unordered_map) could also benefit your testing and profiling.
Edit: Also to mention, did you try to change that cache_size stuff in your sample? Or maybe there are some other performance-related parameters that could be modified?
In an ACM example, I had to build a big table for dynamic programming. I had to store two integers in each cell, so I decided to go for a std::pair<int, int>. However, allocating a huge array of them took 1.5 seconds:
std::pair<int, int> table[1001][1001];
Afterwards, I have changed this code to
struct Cell {
int first;
int second;
}
Cell table[1001][1001];
and the allocation took 0 seconds.
What explains this huge difference in time?
std::pair<int, int>::pair() constructor initializes the fields with default values (zero in case of int) and your struct Cell doesn't (since you only have an auto-generated default constructor that does nothing).
Initializing requires writing to each field which requires a whole lot of memory accesses that are relatively time consuming. With struct Cell nothing is done instead and doing nothing is a bit faster.
The answers so far don't explain the full magnitude of the problem.
As sharptooth has pointed out, the pair solution initializes the values to zero. As Lemurik pointed out, the pair solution isn't just initializing a contiguous block of memory, instead it is calling the pair constructor for every element in the table. However, even that doesn't account for it taking 1.5 seconds. Something else is happening.
Here's my logic:
Assuming you were on an ancient machine, say running at 1.33ghz, then 1.5 seconds is 2e9 clock cycles. You've got 2e6 pairs to construct, so somehow each pair constructor is taking 1000 cycles. It doesn't take 1000 cycles to call a constructor that just sets two integers to zero. I can't see how cache misses would make it take that long. I would believe it if the number was less than 100 cycles.
I thought it would be interesting to see where else all these CPU cycles are going. I used the crappiest oldest C++ compiler I could find to see if I could attain the level of wastage required. That compiler was VC++ v6. In debug mode, it does something I don't understand. It has a big loop that calls the pair constructor for each item in the table - fair enough. That constructor sets the two values to zero - fair enough. But just before doing that, it sets all the bytes in a 68 byte region to 0xcc. That region is just before the start of the the big table. It then overwrites the last element of that region with 0x28F61200. Every call of the pair constructor repeats this. Presumably this is some kind of book keeping by the compiler so it knows which regions are initialized when checking for pointer errors at run time. I'd love to know exactly what this is for.
Anyway, that would explain where the extra time is going. Obviously another compiler may not be this bad. And certainly an optimized release build wouldn't be.
These are all very good guesses, but as everyone knows, guesses are not reliable.
I would say just randomly pause it within that 1.5 seconds, but you'd have to be pretty quick. If you increased each dimension by a factor of about 3, you could make it take more like 10+ seconds, so it would be easier to pause.
Or, you could get it under a debugger, break it in the pair constructor code, and then single step to see what it is doing.
Either way, you would get a firm answer to the question, not just a guess.
My guess it's the way std::pair is created. There is more overhead during invoking a pair constructor 1001x1001 times than when you just allocate a memory range.