Using a thread pool to parallelize a function makes it slower: why? - c++

I am working a on database than runs on top on RocksDB. I have a find function that takes a query in parameter, iterates over all documents in the database, and returns the documents that match the query. I want to parallelize this function so the work is spread on multiple threads.
To achieve that, I tried to use ThreadPool: I moved the code of the loop in a lambda, and added a task to the thread pool for each document. After the loop, each result is processed by the main thread.
Current version (single thread):
void
EmbeDB::find(const bson_t& query,
DocumentPtrCallback callback,
int32_t limit,
const bson_t* projection)
{
int32_t count = 0;
bson_error_t error;
uint32_t num_query_keys = bson_count_keys(&query);
mongoc_matcher_t* matcher = num_query_keys != 0
? mongoc_matcher_new(&query, &error)
: nullptr;
if (num_query_keys != 0 && matcher == nullptr)
{
callback(&error, nullptr);
return;
}
bson_t document;
rocksdb::Iterator* it = _db->NewIterator(rocksdb::ReadOptions());
for (it->SeekToFirst(); it->Valid(); it->Next())
{
const char* bson_data = (const char*)it->value().data();
int bson_length = it->value().size();
std::vector<char> decrypted_data;
if (encryptionEnabled())
{
decrypted_data.resize(bson_length);
bson_length = decrypt_data(bson_data, bson_length, decrypted_data.data(), _encryption_method, _encryption_key, _encryption_iv);
bson_data = decrypted_data.data();
}
bson_init_static(&document, (const uint8_t*)bson_data, bson_length);
if (num_query_keys == 0 || mongoc_matcher_match(matcher, &document))
{
++count;
if (projection != nullptr)
{
bson_error_t error;
bson_t projected;
bson_init(&projected);
mongoc_matcher_projection_execute_noop(
&document,
projection,
&projected,
&error,
NULL
);
callback(nullptr, &projected);
}
else
{
callback(nullptr, &document);
}
if (limit >= 0 && count >= limit)
{
break;
}
}
}
delete it;
if (matcher)
{
mongoc_matcher_destroy(matcher);
}
}
New version (multi-thread):
void
EmbeDB::find(const bson_t& query,
DocumentPtrCallback callback,
int32_t limit,
const bson_t* projection)
{
int32_t count = 0;
bool limit_reached = limit == 0;
bson_error_t error;
uint32_t num_query_keys = bson_count_keys(&query);
mongoc_matcher_t* matcher = num_query_keys != 0
? mongoc_matcher_new(&query, &error)
: nullptr;
if (num_query_keys != 0 && matcher == nullptr)
{
callback(&error, nullptr);
return;
}
auto process_document = [this, projection, num_query_keys, matcher](const char* bson_data, int bson_length) -> bson_t*
{
std::vector<char> decrypted_data;
if (encryptionEnabled())
{
decrypted_data.resize(bson_length);
bson_length = decrypt_data(bson_data, bson_length, decrypted_data.data(), _encryption_method, _encryption_key, _encryption_iv);
bson_data = decrypted_data.data();
}
bson_t* document = new bson_t();
bson_init_static(document, (const uint8_t*)bson_data, bson_length);
if (num_query_keys == 0 || mongoc_matcher_match(matcher, document))
{
if (projection != nullptr)
{
bson_error_t error;
bson_t* projected = new bson_t();
bson_init(projected);
mongoc_matcher_projection_execute_noop(
document,
projection,
projected,
&error,
NULL
);
delete document;
return projected;
}
else
{
return document;
}
}
else
{
delete document;
return nullptr;
}
};
const int WORKER_COUNT = std::max(1u, std::thread::hardware_concurrency());
ThreadPool pool(WORKER_COUNT);
std::vector<std::future<bson_t*>> futures;
bson_t document;
rocksdb::Iterator* db_it = _db->NewIterator(rocksdb::ReadOptions());
for (db_it->SeekToFirst(); db_it->Valid(); db_it->Next())
{
const char* bson_data = (const char*)db_it->value().data();
int bson_length = db_it->value().size();
futures.push_back(pool.enqueue(process_document, bson_data, bson_length));
}
delete db_it;
for (auto it = futures.begin(); it != futures.end(); ++it)
{
bson_t* result = it->get();
if (result)
{
count += 1;
if (limit < 0 || count < limit)
{
callback(nullptr, result);
}
delete result;
}
}
if (matcher)
{
mongoc_matcher_destroy(matcher);
}
}
With simple documents and query, the single-thread version processes 1 million documents in 0.5 second on my machine.
With the same documents and query, the multi-thread version processes 1 million documents in 3.3 seconds.
Surprisingly, the multi-thread version is way slower. Moreover, I measured the execution time and 75% of the time is spent in the for loop. So basically the line futures.push_back(pool.enqueue(process_document, bson_data, bson_length)); takes 75% of the time.
I did the following:
I checked the value of WORKER_COUNT, it is 6 on my machine.
I tried to add futures.reserve(1000000), thinking that maybe the vector re-allocation was at fault, but it didn't change anything.
I tried to remove the dynamic memory allocations (bson_t* document = new bson_t();), it didn't change the result significantly.
So my question is: is there something that I did wrong for the multi-thread version to be that slower than the single-thread version?
My current understanding is that the synchronization operations of the thread pool (when tasks are enqueued and dequeued) are simply consuming the majority of the time, and the solution would be to change the data-structure. Thoughts?

Parallelization has overhead.
It takes around 500 nanoseconds to process each document in the single-threaded version. There's a lot of bookkeeping that has to be done to delegate work to a thread-pool (both to delegate the work, and to synchronize it afterwards), and all that bookkeeping could very well require more than 500 nanoseconds per job.
Assuming your code is correct, then the bookkeeping takes around 2800 nanoseconds per job. To get a significant speedup from parallelization, you're going to want to break the work into bigger chunks.
I recommend trying to process documents in batches of 1000 at a time. Each future, instead of corresponding to just 1 document, will correspond to 1000 documents.
Other optimizations
If possible, avoid unnecessary copying. If something gets copied a bunch, see if you can capture it by reference instead of by value.

Related

How to use redis pipeline to get all updates to db at once?

I use redis++ library and have a redis db which contains keys with ttl set. I want to be informed for all updates to my db. I set __keyevent#0__:set and a subscriber callback like this
subscriber.on_message([&keys = updated_redis_keys](std::string, std::string key) {
keys.push_back(std::move(key));
});
and use a while loop to consume events:
while (true)
{
try
{
subscriber.consume();
for (const auto &key : keys)
pipeline.get(key).ttl(key);
auto replies = pipeline.exec();
for (std::size_t i = 0; i < keys.size(); ++i)
{
static constexpr std::size_t ValueIndex = 0;
static constexpr std::size_t TtlIndex = 1;
const auto value = replies.get<std::optional<std::string>>(i * 2 + ValueIndex);
const auto ttl = replies.get<long long>(i * 2 + TtlIndex);
const auto entry = Entry(std::move(updated_redis_keys[i]), *value, static_cast<std::size_t>(ttl));
do_something_with(entry);
}
keys.clear();
}
catch (const sw::redis::Error &err)
{
}
}
I tried to accumulate keys in keys vector and use pipeline.exec() to get values and ttl of all updated keys at once. But i think subscriber.consume(); just consumes a single event each time so keys.size() always equals to 1.
How can i get better performance by stacking more keys before running exec()?
You can collect a batch of keys by running multiple consume()s before running the pipeline. Even better, you can have a time window threshold, and run the pipeline when reaching the threshold (even if we have not collected enough keys).
const int batch_size = 10;
const std::chrono::seconds time_threshold(30);
while (true) {
auto cnt = 0;
auto begin = std::chrono::steady_clock::now();
std::vector<std::string> keys;
while (cnt < batch_size && std::chrono::steady_clock::now() - begin < time_threshold) {
// Not get enough keys, and we still have time, do consume.
try {
subscriber.consume();
} catch (const Error &e) {
// handle errors.
}
++cnt;
}
// now we've gotten a batch of keys or reached the time threshold. do the pipeline job.
// your original code here.
}

C++ Slow fetching of Recordset rows

I've got an issue with getting the rows in the Recordset as it is really slow.
We've got an virtual ListCtrl where the data is retrieved and set in the "OnGetdispinfo" method.
This is pretty fast (~2 Seconds for 300k rows on localhost) however if the connection is slow the GUI becomes unrepsonsive and completly unusable until the job is finished.
So I've tried to do the Sql stuff in a different thread and updating the list once all data is fetched.
The issue with the unresponsive GUI is solved with that, but the time it takes to get all the data jumped from 2 seconds to several minutes.
Even if I dont do anything but loop through the rows (just calling MoveNext() in the loop until EOF is reached) it will still take over a minute to complete.
How do I resolve the issue with the freezing GUI without completly destroying the performance here?
I've included the relevant code below
m_pRecordset is a normal Recordset
Old:
void KundenListControlSQLCommand::OnGetdispinfo(NMHDR* pNMHDR, LRESULT* pResult)
{
if (m_pRecordset->IsBOF())
{
*pResult = 0;
return;
}
LV_DISPINFO* pDispInfo = (LV_DISPINFO*)pNMHDR;
LV_ITEM* pItem = &(pDispInfo)->item;
if (pItem->mask & LVIF_TEXT)
{
CString strData;
m_pRecordset->SetAbsolutePosition(pItem->iItem + 1);
if (getStatusRow() != pItem->iSubItem)
{
m_pRecordset->GetFieldValue(short(pItem->iSubItem), strData);
}
::lstrcpy(pItem->pszText, strData);
}
if (pItem->mask & LVIF_IMAGE)
{
int const nIndex = this->GetParent()->SendMessage(OT_VLC_ONGETIMAGEINDEX, pItem->iItem, 0);
if (0 != nIndex)
{
pItem->iImage = nIndex - 1;
}
}
*pResult = 0;
}
void KundenListControlSQLCommand::loadAndDisplayData()
{
ASSERT(!m_strSQLCommand.IsEmpty());
CWaitCursor wc;
try
{
if (!m_pDatabase->IsOpen())
{
CString strSQL = m_pDatabase->getDatabaseInfo().getConnectString();
m_pDatabase->OpenEx(strSQL);
}
// RecordCount ermitteln
m_nRecordCount = m_pRecordset->selectCount(_T("*"), m_strSQLCommand);
if (m_pRecordset->IsOpen())
m_pRecordset->Close();
m_pRecordset->Open(Recordset::snapshot, m_strSQLCommand + m_strSortOrder,
Recordset::executeDirect | Recordset::noDirtyFieldCheck |
Recordset::readOnly | Recordset::useBookmarks);
SetItemCountEx(m_nRecordCount);
}
catch (CDBException* e)
{
e->ReportError();
e->Delete();
}
}
New:
void KundenListControlSQLCommand::loadAndDisplayData()
{
ASSERT(!m_strSQLCommand.IsEmpty());
CWaitCursor wc;
try
{
if (!m_pDatabase->IsOpen())
{
CString strSQL = m_pDatabase->getDatabaseInfo().getConnectString();
m_pDatabase->OpenEx(strSQL);
}
// RecordCount ermitteln
m_nRecordCount = m_pRecordset->selectCount(_T("*"), m_strSQLCommand);
if (m_pRecordset->IsOpen())
m_pRecordset->Close();
m_pRecordset->Open(Recordset::dynaset, m_strSQLCommand + m_strSortOrder,
Recordset::executeDirect | Recordset::noDirtyFieldCheck |
Recordset::readOnly | Recordset::useBookmarks);
m_vResult.clear();
m_vResult.reserve(m_nRecordCount);
int nFieldCount = m_pRecordset->GetODBCFieldCount();
CString strData;
while (!m_pRecordset->IsEOF())
{
for (auto i = 0; i < nFieldCount; i++)
{
m_pRecordset->GetFieldValue(short(i), strData);
m_vResult.push_back(std::move(strData));
}
if (m_bAbort)
{
m_bAbort = false;
return;
}
m_pRecordset->MoveNext();
}
GetParent()->SendMessage(OT_VLC_ON_LIST_DONE, NULL, NULL);
}
catch (CDBException* e)
{
e->ReportError();
e->Delete();
}
}
void KundenListControlSQLCommand::OnGetdispinfo(NMHDR* pNMHDR, LRESULT* pResult)
{
if (m_pRecordset->IsBOF())
{
*pResult = 0;
return;
}
LV_DISPINFO* pDispInfo = (LV_DISPINFO*)pNMHDR;
LV_ITEM* pItem = &(pDispInfo)->item;
UINT nItem = (pItem->iItem * m_pRecordset->GetODBCFieldCount()) + pItem->iSubItem;
if (pItem->mask & LVIF_TEXT && m_vResult.size() >= nItem)
{
::lstrcpy(pItem->pszText, std::move(m_vResult.at(nItem)));
}
if (pItem->mask & LVIF_IMAGE)
{
int const nIndex = this->GetParent()->SendMessage(OT_VLC_ONGETIMAGEINDEX, pItem->iItem, 0);
if (0 != nIndex)
{
pItem->iImage = nIndex - 1;
}
}
*pResult = 0;
}``
As I can see in your code, you read the data and place them into the vector. In such a setting, I think you don't really need a dynaset recordset, which according to the documentation is "A recordset with bi-directional scrolling". It fetches data row-by-row, which may be what makes the process slow. Also, "changes made by other users to the data values are visible following a fetch operation", but I think this is not of critical importance in this case. It would be mostly useful for displaying more "live" data, that are updated often.
Instead, a snapshot, or even forwardOnly recordset would suffice and would be faster. You can also experiment with the CRecordset::useMultiRowFetch option. The documentation says it's faster. It requires some changes to your code (moving next etc). Take a look here: Recordset: Fetching Records in Bulk (ODBC).
An alternative, radically different implementation would be to use bookmarks instead. Loading would be a lot faster, but scrolling somewhat sluggish, as you will have to fetch data in the OnGetdispinfo() function.
Finally a tip, if you are using the MS-SQL server, check the native driver, if you haven't already, many on the i-net claim that it's considerably faster.
I don't know much about ODBC, but suspect that there are better way to get bulk data.
Regardless, you do a lot of unnecessary copying of your vectors. Two easy fixes:
Right after m_vResult.clear();, resize your m_vResult to the number of records.
Instead of m_vResult.push_back(vResult); do m_vResult.push_back(std::move(vResult));, as you don't need your vResult after that.
Another solution is to do a cache list, handling LVN_ODCACHEHINT notification (this example is for CListView but you can adapt it on your CListCtrl:
// header.h
class CYourListView : public CListView
{
// ...
afx_msg void OnLvnOdcachehint(NMHDR* pNMHDR, LRESULT *pResult);
};
and implementation:
// YourListView.cpp
// ...
ON_NOTIFY_REFLECT(LVN_ODCACHEHINT, &CYourListView::OnLvnOdcachehint)
END_MESSAGE_MAP()
void CYourListView::OnLvnOdcachehint(NMHDR* pNMHDR, LRESULT* pResult)
{
LPNMLVCACHEHINT pCacheHint = reinterpret_cast<LPNMLVCACHEHINT>(pNMHDR);
const DWORD dwTo = pCacheHint->iTo;
const DWORD dwFetched = m_vResult.size();
if (dwTo >= dwFetched) // new rows must be fetched
{
const DWORD dwColCount = m_pRecordset->GetColumnCount();
m_vResult.resize(dwTo + 1);
for (DWORD dwRow = dwFetched; dwRow <= dwTo; ++dwRow)
{
CDBRecord* pRecord = new CDBRecord;
pRecord->SetSize(dwColCount);
for (DWORD dwCol = 1; dwCol <= dwColCount; dwCol++)
{
CDBValue* pDBValue = new CDBValue(m_pRecordset, dwCol);
pRecord->SetAt(dwCol - 1, pDBValue);
}
m_vResult.emplace(m_vResult.begin() + dwRow, pRecord);
m_pRecordset->MoveNext();
}
}
*pResult = 0;
}
might be need to adjust some variables / values with your certain situation.

More efficient way for reading CAN data in while loop

I have 3 devices which send 8 bytes of data over CAN interface. To read the buffer from CAN I am using a while loop which looks something like this:
void CanServer::ReadFromCAN() {
data_from_buffer_.clear();
can_frame frame;
read_can_port_ = read(soc_, &frame, sizeof(struct can_frame));
if (read_can_port_ < 0) return;
id_ = frame.can_id&0x1FFFFFFF;
dlc_ = frame.can_dlc;
for (const auto& byte : frame.data)
data_from_buffer_.push_back(byte);
}
while (ros::ok()) {
std_msgs::Int32MultiArray tachometer_array;
std::vector<__u8> data_from_can;
/***
* Read for the Radar1
*/
this->ReadFromCAN();
if (read_can_port_ < 0) continue;
//ROS_INFO("Read from CAN");
if (id_ == can_id::RadarFrame1)
for (int i = 0; i < dlc_; i++) {
radar1_bytes_[i] = data_from_buffer_[i];
radar1_buffer_.push_back(data_from_buffer_[i]);
}
if (IsMagicWord(radar1_bytes_, 0)) {
frame_id = "radar1_link";
this->PulbishRadarPCL(frame_id, radar1_pub_, radar1_buffer_, 0);
radar1_buffer_.clear();
canFrame_.can_dlc = 0;
}
}
if (id_ == can_id::RadarFrame2) {
for (int i = 0; i < dlc_; i++) {
radar2_bytes_[i] = data_from_buffer_[i];
radar2_buffer_.push_back(data_from_buffer_[i]);
}
if (IsMagicWord(radar2_bytes_, 1)) {
frame_id = "radar2_link";
this->PulbishRadarPCL(frame_id, radar2_pub_, radar2_buffer_, 1);
radar2_buffer_.clear();
canFrame_.can_dlc = 0;
}
}
if (id_ == can_id::RadarFrame3) {
for (int i = 0; i < dlc_; i++) {
radar3_bytes_[i] = data_from_buffer_[i];
radar3_buffer_.push_back(data_from_buffer_[i]);
}
if (IsMagicWord(radar3_bytes_, 2)) {
frame_id = "radar3_link";
this->PulbishRadarPCL(frame_id, radar3_pub_, radar3_buffer_, 2);
radar3_buffer_.clear();
canFrame_.can_dlc = 0;
}
}
rate.sleep();
}
Where rate.sleep() is similar to sleep() function in C++.
Right now, I am running this while loop in 5 MHz however I think this is an overkill and I am getting almost 100% CPU usage on a 1 core.
I tried to play around with the delay time but I think this is highly inefficient and I wonder is there any other way to handle this?
It turns out that poll is what you need. Here is my example.
First, create a pollfd structure from <poll.h> header in Linux. I have decided to create a class member but you can create however you like:
pollfd poll_;
poll_.fd = soc_;
poll_.events = POLLIN;
poll_.revents = 0;
Here, soc_ is a socket and POLLIN means that you want to read from the socket.
Then, in my while loop, instead of delaying I just used this function at the beginning of my while loop:
poll_int = poll(&poll_, 1, 100);
if (poll_int <= 0) continue;
So poll() function returns value of 1 if the read was succesful and I made a timeout of 100ms (just a random number, I know that the data are coming at much higher rate)
With that, you will only read the data from socket whenever poll returns a value greater that 0.
Results? 3% CPU usage and if you want to add more data into your socket flow, poll will optimize for you so this is a scalable way of reading something like CAN bus.

c++ find and handle each lines is slower than php

Hello i created a program to handle a config file line by checking each lines and get the config blocks but for first time i made it with php and the speed was amazing. we have some blocks like this
Block {
}
php program can read each line and detect about 50,000 of this blocks in just 1 second after that i went to c++ to create my program in c++ but i saw a very very bad problem. my program was too slow (read 50,000 of this blocks in 55 seconds) while my php codes was exactly the same of c++ codes (in action and activity). php was 55x faster than c++ while the codes are the same.
this is my code in php
const PATH = "conf.txt";
if(!file_exists(PATH)) die("path_not_found");
if(!is_readable((PATH))) die("path_not_readable");
$Lines = explode("\r\n", file_get_contents(PATH));
class Block
{
public $Name;
public $Keys = array();
public $Blocks = array();
}
function Handle(& $Lines, $Start, & $Return_block, & $End_on)
{
for ($i = $Start; $i < count($Lines); $i++)
{
while (trim($Lines[$i]) != "")
{
$Pos1 = strpos($Lines[$i], "{");
$Pos2 = strpos($Lines[$i], "}");
if($Pos1 !== false && ($Pos2 === false || $Pos2 > $Pos1)) // Detect { in less position
{
$thisBlock = new Block();
$thisBlock->Name = trim(substr($Lines[$i], 0, $Pos1));
$Lines[$i] = substr($Lines[$i], $Pos1 + 1);
Handle($Lines, $i, $thisBlock, $i);
$Return_block->Blocks[] = $thisBlock;
}
else { // Detect } in less position than {
$Lines[$i] = substr($Lines[$i], $Pos2 + 1);
$End_on = $i;
return;
}
}
}
}
$DefaultBlock = new Block();
Handle($Lines, 0, $DefaultBlock, $NullValue);
$OutsideKeys = $DefaultBlock->Keys;
$Blocks = $DefaultBlock->Blocks;
echo "Found (".count($OutsideKeys).") keys and (".count($Blocks).") blocks.<br><br>";
and this is my code in C++
string Trim(string & s)
{
auto wsfront = std::find_if_not(s.begin(), s.end(), [](int c) {return std::isspace(c); });
auto wsback = std::find_if_not(s.rbegin(), s.rend(), [](int c) {return std::isspace(c); }).base();
return (wsback <= wsfront ? std::string() : std::string(wsfront, wsback));
}
class Block
{
private:
string Name;
vector <Block> Blocks;
public:
void Add(Block & thisBlock) { Blocks.push_back(thisBlock); }
Block(string Getname = string()) { Name = Getname; }
int Count() { return Blocks.size(); }
};
void Handle(vector <string> & Lines, size_t Start, Block & Return, size_t & LastPoint, bool CheckEnd = true)
{
for (size_t i = Start; i < Lines.size(); i++)
{
while (Trim(Lines[i]) != "")
{
size_t Pos1 = Lines[i].find("{");
size_t Pos2 = Lines[i].find("}");
if (Pos1 != string::npos && (Pos2 == string::npos || Pos1 < Pos2)) // Found {
{
string Name = Trim(Lines[i].substr(0, Pos1));
Block newBlock = Block(Name);
Lines[i] = Lines[i].substr(Pos1 + 1);
Handle(Lines, i, newBlock, i);
Return.Add(newBlock);
}
else { // Found }
Lines[i] = Lines[i].substr(Pos2 + 1);
return;
}
}
}
}
int main()
{
string Cont;
___PATH::GetFileContent("D:\\conf.txt", Cont);
vector <string> Lines = ___String::StringSplit(Cont, "\r\n");
Block Return;
size_t Temp;
// The problem (low handle speed) start from here not from including or split
Handle(Lines, 0, Return, Temp);
cout << "Is(" << Return.Count() << ")" << endl;
return 0;
}
as you can see, this codes are exactly the same in action but i don't know why php handling in this code is 55x faster than my c++ codes. you can create a txt file and create about 50,000 of this block's
Block {
}
and test it yourself. please help me to fix this. i am really confused (same codes but not same performance
php = 50,000 blocks and detect in 1 second
c++ = 50,000 blocks and detect in 55 seconds (and maybe more) !
i have no problem in my program design. because i got my performance completely on php but my problem is on c++ that is 55x slower than php in same code action !
i am using (visual studio 2017) to compile this program (c++)
First, "code" is singular, not plural.
C++ is a very different language than php. It is not "the same code", and it is nowhere near the same in action.
For example, these two lines:
Block newBlock = Block(Name);
Return.Add(newBlock);
First create a Block on the stack, and then call Block's copy constructor to make another one inside the vector. You then throw away the stack object.
Also, vectors guarantee that they are contiguous, so as you add new Blocks via your Add method, vector will occasionally stop, allocate another chunk of memory (twice as big as the last one, iirc), copy everything over to that new chunk, and then free the old one. Either preallocate the vector (via vector::reserve()), or consider using something like a deque that doesn't guarantee continuity in memory if you don't need that property.
I also don't know what ___String::StringSplit does, but you are almost certain to have the same vector growth problem in reading your file.
Culprit is in these 2 lines:
Handle(Lines, i, newBlock, i);
Return.Add(newBlock);
Let's say you have 5 levels of 1 block each. What Happens on bottom one? You copy one instance of block. What happens on level 4? You copy 2 blocks (parent and its child). So for level 5 you make 15 copies - 1+2+3+4+5. Look at this diagram:
Handle level1 copies 5 blocks (`Return`->level4->level3->level4->level5)
Handle level2 copies 4 blocks (`Return`->level3->level4->level5)
Handle level3 copies 3 blocks (`Return`->level4->level5
Handle level4 copies 2 blocks (`Return`->level5)
Handle level5 copies 1 block (`Return`)
Formula is:
S = ( N + N^2 ) / 2
so for levels 20 you would do 210 copies and so on.
Suggestion is to use move semantics to avoid this copy:
// change method Add to this
void Add(Block thisBlock) { Blocks.push_back(std::move(thisBlock)); }
// and change this call
Return.Add( std::move( newBlock ) );
Or allocate blocks dynamically using smart pointers
Out of simple curiousity, try this Trim implementation instead:
void _Trim(std::string& result, const std::string& s) {
const auto* ptr = s.data();
const auto* left = ptr;
const auto* end = s.data() + s.size();
while (ptr < end && std::isspace(*ptr)) {
++ptr;
}
if (ptr == end) {
result = "";
return;
}
left = ptr;
while (end > left && std::isspace(*(end-1))) {
--end;
}
result = std::string(left, end);
}
std::string Trim(const std::string& s) {
// Not sure if RVO would fire for direct implementation of _Trim here
std::string result;
_Trim(result, s);
return result;
}
And another optimization:
void Add(Block& thisBlock) {
Blocks.push_back(std::move(thisBlock));
}
// Don't use thisBlock after call to this function. It is
// far from being pretty but it should avoid *lots* of copies.
I wonder if you'll get better result. Pls let me know.

Fastest and safest way to call functions in extern process

Describtion of the problem:
we need to call a function in extern process as fast as possible. Boost interprocess shared memory is used for communication. The extern process is either mpi master or a single executable. The calculation time of the function lies between 1ms and 1s. The function should be called up to 10^8-10^9 times.
I've tried a lot of possibilities, but I still have some problems with each of them. Here I introduce two of best working implementations
Version 1 ( using intreprocess conditions )
Main-process
bool calculate(double& result, std::vector<double> c){
// data_ptr is a structure in shared memoty
data_ptr_->validCalculation = false;
bool timeout = false;
// write data (cVec_ is a vector in shared memory )
cVec_->clear();
for (int i = 0; i < c.size(); ++i)
{
cVec_->push_back(c[i]);
}
// cond_input_data is boost interprocess condition
data_ptr_->cond_input_data.notify_one();
boost::system_time const waittime = boost::get_system_time() + boost::posix_time::seconds(maxWaitTime_in_sec);
// lock slave process
scoped_lock<interprocess_mutex> lock_output(data_ptr_->mutex_output);
// wait till data calculated
timeout = !(data_ptr_->cond_output_data.timed_wait(lock_output, waittime)); // true if timeout, false if no timeout
if (!timeout)
{
// get result
result = *result_;
return data_ptr_->validCalculation;
}
else
{
return false;
}
};
Extern process runs a while-loop ( till abort condition is fullfilled)
do {
scoped_lock<interprocess_mutex> lock_input(data_ptr_->mutex_input);
boost::system_time const waittime = boost::get_system_time() + boost::posix_time::seconds(maxWaitTime_in_sec);
timeout = !(data_ptr_->cond_input_data.timed_wait(lock_input, waittime)); // true if timeout, false if no timeout
if (!timeout)
{
if (!*abort_flag_) {
c.clear();
for (int i = 0; i < (*cVec_).size(); ++i) //Insert data in the vector
{
c.push_back(cVec_->at(i));
}
// calculate value
if (call_of_function_here(result, c)) { // valid calculation ?
*result_ = result;
data_ptr_->validCalculation = true;
}
}
}
//Notify the other process that the data is avalible or we dont get the input data
data_ptr_->cond_output_data.notify_one();
} while (!*abort_flag_); // while abort flag is not set, check if some values should be calculated
This is best working version, but sometimes it holds up, if the calculation time is short (~1ms). I assume, it happens, if main-process reaches
data_ptr_->cond_input_data.notify_one();
earlier, than extern process is waiting on
timeout = !(data_ptr_->cond_input_data.timed_wait(lock_input, waittime));
waiting condition. So we have probably some kind of synchronisation problem.
Second condition does not help ( i.e. wait only if input data not set, similar to the anonymous condition example with message_in flag). Since, it is still possible, that one process notify the other one, before the second one is waiting for notification.
Version 2 ( using boolean flag and while loop with some delay )
Main-process
bool calculate(double& result, std::vector<double> c){
data_ptr_->validCalculation = false;
bool timeout = false;
// write data
cVec_->clear();
for (int i = 0; i < c.size(); ++i) //Insert data in the vector
{
cVec_->push_back(c[i]);
}
// this is the flag in shared memory used for communication
*calc_flag_ = true;
clock_t test_begin = clock();
clock_t calc_time_begin = clock();
do
{
calc_time_begin = clock();
boost::this_thread::sleep(boost::posix_time::milliseconds(while_loop_delay_m_s));
// wait till data calculated
timeout = (double(calc_time_begin - test_begin) / CLOCKS_PER_SEC > maxWaitTime_in_sec);
} while (*(calc_flag_) && !timeout);
if (!timeout)
{
// get result
result = *result_;
return data_ptr_->validCalculation;
}
else
{
return false;
}
};
and the extern process
do {
// we wait till input data is set
wait_begin = clock();
do
{
wait_end = clock();
timeout = (double(wait_end - wait_begin) / CLOCKS_PER_SEC > maxWaitTime_in_sec);
boost::this_thread::sleep(boost::posix_time::milliseconds(while_loop_delay_m_s));
} while (!(*calc_flag_) && !(*abort_flag_) && !timeout);
if (!timeout)
{
if (!*abort_flag_) {
c.clear();
for (int i = 0; i < (*cVec_).size(); ++i) //Insert data in the vector
{
c.push_back(cVec_->at(i));
}
// calculate value
if (call_of_local_function(result, c)) { // valid calculation ?
*result_ = result;
data_ptr_->validCalculation = true;
}
}
}
//Notify the other process that the data is avalible or we dont get the input data
*calc_flag_ = false;
} while (!*abort_flag_); // while abort flag is not set, check if some values should be calculated
The problem in this version is the delay-time. Since we have calculation times close to 1ms, we have to set the delay at least to this value. For smaller delays the cpu-load is high, for higher delays we lose a lot of performance due to not necessary waiting time
Do you have an idea how to improve one of this versions? or may be there is a better solution?
thx.