Fastest and safest way to call functions in extern process - c++

Describtion of the problem:
we need to call a function in extern process as fast as possible. Boost interprocess shared memory is used for communication. The extern process is either mpi master or a single executable. The calculation time of the function lies between 1ms and 1s. The function should be called up to 10^8-10^9 times.
I've tried a lot of possibilities, but I still have some problems with each of them. Here I introduce two of best working implementations
Version 1 ( using intreprocess conditions )
Main-process
bool calculate(double& result, std::vector<double> c){
// data_ptr is a structure in shared memoty
data_ptr_->validCalculation = false;
bool timeout = false;
// write data (cVec_ is a vector in shared memory )
cVec_->clear();
for (int i = 0; i < c.size(); ++i)
{
cVec_->push_back(c[i]);
}
// cond_input_data is boost interprocess condition
data_ptr_->cond_input_data.notify_one();
boost::system_time const waittime = boost::get_system_time() + boost::posix_time::seconds(maxWaitTime_in_sec);
// lock slave process
scoped_lock<interprocess_mutex> lock_output(data_ptr_->mutex_output);
// wait till data calculated
timeout = !(data_ptr_->cond_output_data.timed_wait(lock_output, waittime)); // true if timeout, false if no timeout
if (!timeout)
{
// get result
result = *result_;
return data_ptr_->validCalculation;
}
else
{
return false;
}
};
Extern process runs a while-loop ( till abort condition is fullfilled)
do {
scoped_lock<interprocess_mutex> lock_input(data_ptr_->mutex_input);
boost::system_time const waittime = boost::get_system_time() + boost::posix_time::seconds(maxWaitTime_in_sec);
timeout = !(data_ptr_->cond_input_data.timed_wait(lock_input, waittime)); // true if timeout, false if no timeout
if (!timeout)
{
if (!*abort_flag_) {
c.clear();
for (int i = 0; i < (*cVec_).size(); ++i) //Insert data in the vector
{
c.push_back(cVec_->at(i));
}
// calculate value
if (call_of_function_here(result, c)) { // valid calculation ?
*result_ = result;
data_ptr_->validCalculation = true;
}
}
}
//Notify the other process that the data is avalible or we dont get the input data
data_ptr_->cond_output_data.notify_one();
} while (!*abort_flag_); // while abort flag is not set, check if some values should be calculated
This is best working version, but sometimes it holds up, if the calculation time is short (~1ms). I assume, it happens, if main-process reaches
data_ptr_->cond_input_data.notify_one();
earlier, than extern process is waiting on
timeout = !(data_ptr_->cond_input_data.timed_wait(lock_input, waittime));
waiting condition. So we have probably some kind of synchronisation problem.
Second condition does not help ( i.e. wait only if input data not set, similar to the anonymous condition example with message_in flag). Since, it is still possible, that one process notify the other one, before the second one is waiting for notification.
Version 2 ( using boolean flag and while loop with some delay )
Main-process
bool calculate(double& result, std::vector<double> c){
data_ptr_->validCalculation = false;
bool timeout = false;
// write data
cVec_->clear();
for (int i = 0; i < c.size(); ++i) //Insert data in the vector
{
cVec_->push_back(c[i]);
}
// this is the flag in shared memory used for communication
*calc_flag_ = true;
clock_t test_begin = clock();
clock_t calc_time_begin = clock();
do
{
calc_time_begin = clock();
boost::this_thread::sleep(boost::posix_time::milliseconds(while_loop_delay_m_s));
// wait till data calculated
timeout = (double(calc_time_begin - test_begin) / CLOCKS_PER_SEC > maxWaitTime_in_sec);
} while (*(calc_flag_) && !timeout);
if (!timeout)
{
// get result
result = *result_;
return data_ptr_->validCalculation;
}
else
{
return false;
}
};
and the extern process
do {
// we wait till input data is set
wait_begin = clock();
do
{
wait_end = clock();
timeout = (double(wait_end - wait_begin) / CLOCKS_PER_SEC > maxWaitTime_in_sec);
boost::this_thread::sleep(boost::posix_time::milliseconds(while_loop_delay_m_s));
} while (!(*calc_flag_) && !(*abort_flag_) && !timeout);
if (!timeout)
{
if (!*abort_flag_) {
c.clear();
for (int i = 0; i < (*cVec_).size(); ++i) //Insert data in the vector
{
c.push_back(cVec_->at(i));
}
// calculate value
if (call_of_local_function(result, c)) { // valid calculation ?
*result_ = result;
data_ptr_->validCalculation = true;
}
}
}
//Notify the other process that the data is avalible or we dont get the input data
*calc_flag_ = false;
} while (!*abort_flag_); // while abort flag is not set, check if some values should be calculated
The problem in this version is the delay-time. Since we have calculation times close to 1ms, we have to set the delay at least to this value. For smaller delays the cpu-load is high, for higher delays we lose a lot of performance due to not necessary waiting time
Do you have an idea how to improve one of this versions? or may be there is a better solution?
thx.

Related

Global variable doesn't update prior to next loop

I'm trying to build a tachometer in C++ for my ESP32. When I uncomment Serial.printf("outside rev: %d \n", rev); outside of the conditional it works, but when I comment it I get values that are orders of magnitude greater than they should be (700 revolutions without, vs 7 revolutions with). My best guess is that the print statement is slowing the loop() down just enough for incrementRevolutions() to toggle the global variable passedMagnet from true to false before the next loop. That would make sense, since a delay in updating passedMagnet would allow newRevCount++; to be triggered multiple times. But this is obviously something I can't debug with either print statements or step-through debugging given the time-sensitive nature of the race condition.
bool passedMagnet = true;
int incrementRevolutions(int runningRevCount, bool passingMagnet)
{
// Serial.printf("passedMagnet: %d , passingMagnet %d , runningRevCount: %d \n", passedMagnet, passingMagnet, runningRevCount);
int newRevCount = runningRevCount;
if (passedMagnet && passingMagnet)
{ //Started a new pass of the magnet
passedMagnet = false;
newRevCount++;
}
else if (!passedMagnet && !passingMagnet)
{ //The new pass of the magnet is complete
passedMagnet = true;
}
return newRevCount;
}
unsigned long elapsedTime = 0;
unsigned long intervalTime = 0;
int rev = 0;
void loop()
{
intervalTime = millis() - elapsedTime;
rev = incrementRevolutions(rev, digitalRead(digitalPin));
// Serial.printf("outside rev: %d \n", rev);
if (intervalTime > 1000)
{
Serial.printf("rev: %d \n", rev);
rev = 0;
elapsedTime = millis();
}
}
Is this a known gotcha with Arduino or C++ programming? What should I do to fix it?
I think the test is to blame. I had to rename and move things a bit to visualize the logic, sorry about that.
bool magStateOld = false; // initialize to digitalRead(digitalPin) in setup()
int incrementRevolutions(int runningRevCount, bool magState)
{
int newRevCount = runningRevCount;
// detect positive edge.
if (magState && !magStateOld) // <- was eq. to if (magState && magStateOld)
// the large counts came from here.
{
newRevCount++;
}
magStateOld = magState; // record last state unconditionally
return newRevCount;
}
You could also write it as...
int incrementRevolutions(int n, bool magState)
{
n += (magState && !magStateOld);
magStateOld = magState;
return n;
}
But the most economical (and fastest) way of doing what you want would be:
bool magStateOld;
inline bool positiveEdge(bool state, bool& oldState)
{
bool result = (state && !oldState);
oldState = state;
return result;
}
void setup()
{
// ...
magStateOld = digitalRead(digitalPin);
}
void loop()
{
// ...
rev += (int)positiveEdge(digitalRead(digitalPin), magStateOld);
// ...
}
It's reusable, and saves both stack space and unnecessary assignments.
If you cannot get clean transitions from your sensor (noise on positive and negative edges, you'll need to debounce the signal a bit, using a timer.
Example:
constexpr byte debounce_delay = 50; // ms, you may want to play with
// this value, smaller is better.
// but must be high enough to
// avoid issues on expected
// RPM range.
// 50 ms is on the high side.
byte debounce_timestamp; // byte is large enough for delays
// up to 255ms.
// ...
void loop()
{
// ...
byte now = (byte)millis();
if (now - debounce_timestamp >= debounce_delay)
{
debounce_timestamp = now;
rev += (int)positiveEdge(digitalRead(digitalPin), magStateOld);
}
// ...
}

More efficient way for reading CAN data in while loop

I have 3 devices which send 8 bytes of data over CAN interface. To read the buffer from CAN I am using a while loop which looks something like this:
void CanServer::ReadFromCAN() {
data_from_buffer_.clear();
can_frame frame;
read_can_port_ = read(soc_, &frame, sizeof(struct can_frame));
if (read_can_port_ < 0) return;
id_ = frame.can_id&0x1FFFFFFF;
dlc_ = frame.can_dlc;
for (const auto& byte : frame.data)
data_from_buffer_.push_back(byte);
}
while (ros::ok()) {
std_msgs::Int32MultiArray tachometer_array;
std::vector<__u8> data_from_can;
/***
* Read for the Radar1
*/
this->ReadFromCAN();
if (read_can_port_ < 0) continue;
//ROS_INFO("Read from CAN");
if (id_ == can_id::RadarFrame1)
for (int i = 0; i < dlc_; i++) {
radar1_bytes_[i] = data_from_buffer_[i];
radar1_buffer_.push_back(data_from_buffer_[i]);
}
if (IsMagicWord(radar1_bytes_, 0)) {
frame_id = "radar1_link";
this->PulbishRadarPCL(frame_id, radar1_pub_, radar1_buffer_, 0);
radar1_buffer_.clear();
canFrame_.can_dlc = 0;
}
}
if (id_ == can_id::RadarFrame2) {
for (int i = 0; i < dlc_; i++) {
radar2_bytes_[i] = data_from_buffer_[i];
radar2_buffer_.push_back(data_from_buffer_[i]);
}
if (IsMagicWord(radar2_bytes_, 1)) {
frame_id = "radar2_link";
this->PulbishRadarPCL(frame_id, radar2_pub_, radar2_buffer_, 1);
radar2_buffer_.clear();
canFrame_.can_dlc = 0;
}
}
if (id_ == can_id::RadarFrame3) {
for (int i = 0; i < dlc_; i++) {
radar3_bytes_[i] = data_from_buffer_[i];
radar3_buffer_.push_back(data_from_buffer_[i]);
}
if (IsMagicWord(radar3_bytes_, 2)) {
frame_id = "radar3_link";
this->PulbishRadarPCL(frame_id, radar3_pub_, radar3_buffer_, 2);
radar3_buffer_.clear();
canFrame_.can_dlc = 0;
}
}
rate.sleep();
}
Where rate.sleep() is similar to sleep() function in C++.
Right now, I am running this while loop in 5 MHz however I think this is an overkill and I am getting almost 100% CPU usage on a 1 core.
I tried to play around with the delay time but I think this is highly inefficient and I wonder is there any other way to handle this?
It turns out that poll is what you need. Here is my example.
First, create a pollfd structure from <poll.h> header in Linux. I have decided to create a class member but you can create however you like:
pollfd poll_;
poll_.fd = soc_;
poll_.events = POLLIN;
poll_.revents = 0;
Here, soc_ is a socket and POLLIN means that you want to read from the socket.
Then, in my while loop, instead of delaying I just used this function at the beginning of my while loop:
poll_int = poll(&poll_, 1, 100);
if (poll_int <= 0) continue;
So poll() function returns value of 1 if the read was succesful and I made a timeout of 100ms (just a random number, I know that the data are coming at much higher rate)
With that, you will only read the data from socket whenever poll returns a value greater that 0.
Results? 3% CPU usage and if you want to add more data into your socket flow, poll will optimize for you so this is a scalable way of reading something like CAN bus.

Using a thread pool to parallelize a function makes it slower: why?

I am working a on database than runs on top on RocksDB. I have a find function that takes a query in parameter, iterates over all documents in the database, and returns the documents that match the query. I want to parallelize this function so the work is spread on multiple threads.
To achieve that, I tried to use ThreadPool: I moved the code of the loop in a lambda, and added a task to the thread pool for each document. After the loop, each result is processed by the main thread.
Current version (single thread):
void
EmbeDB::find(const bson_t& query,
DocumentPtrCallback callback,
int32_t limit,
const bson_t* projection)
{
int32_t count = 0;
bson_error_t error;
uint32_t num_query_keys = bson_count_keys(&query);
mongoc_matcher_t* matcher = num_query_keys != 0
? mongoc_matcher_new(&query, &error)
: nullptr;
if (num_query_keys != 0 && matcher == nullptr)
{
callback(&error, nullptr);
return;
}
bson_t document;
rocksdb::Iterator* it = _db->NewIterator(rocksdb::ReadOptions());
for (it->SeekToFirst(); it->Valid(); it->Next())
{
const char* bson_data = (const char*)it->value().data();
int bson_length = it->value().size();
std::vector<char> decrypted_data;
if (encryptionEnabled())
{
decrypted_data.resize(bson_length);
bson_length = decrypt_data(bson_data, bson_length, decrypted_data.data(), _encryption_method, _encryption_key, _encryption_iv);
bson_data = decrypted_data.data();
}
bson_init_static(&document, (const uint8_t*)bson_data, bson_length);
if (num_query_keys == 0 || mongoc_matcher_match(matcher, &document))
{
++count;
if (projection != nullptr)
{
bson_error_t error;
bson_t projected;
bson_init(&projected);
mongoc_matcher_projection_execute_noop(
&document,
projection,
&projected,
&error,
NULL
);
callback(nullptr, &projected);
}
else
{
callback(nullptr, &document);
}
if (limit >= 0 && count >= limit)
{
break;
}
}
}
delete it;
if (matcher)
{
mongoc_matcher_destroy(matcher);
}
}
New version (multi-thread):
void
EmbeDB::find(const bson_t& query,
DocumentPtrCallback callback,
int32_t limit,
const bson_t* projection)
{
int32_t count = 0;
bool limit_reached = limit == 0;
bson_error_t error;
uint32_t num_query_keys = bson_count_keys(&query);
mongoc_matcher_t* matcher = num_query_keys != 0
? mongoc_matcher_new(&query, &error)
: nullptr;
if (num_query_keys != 0 && matcher == nullptr)
{
callback(&error, nullptr);
return;
}
auto process_document = [this, projection, num_query_keys, matcher](const char* bson_data, int bson_length) -> bson_t*
{
std::vector<char> decrypted_data;
if (encryptionEnabled())
{
decrypted_data.resize(bson_length);
bson_length = decrypt_data(bson_data, bson_length, decrypted_data.data(), _encryption_method, _encryption_key, _encryption_iv);
bson_data = decrypted_data.data();
}
bson_t* document = new bson_t();
bson_init_static(document, (const uint8_t*)bson_data, bson_length);
if (num_query_keys == 0 || mongoc_matcher_match(matcher, document))
{
if (projection != nullptr)
{
bson_error_t error;
bson_t* projected = new bson_t();
bson_init(projected);
mongoc_matcher_projection_execute_noop(
document,
projection,
projected,
&error,
NULL
);
delete document;
return projected;
}
else
{
return document;
}
}
else
{
delete document;
return nullptr;
}
};
const int WORKER_COUNT = std::max(1u, std::thread::hardware_concurrency());
ThreadPool pool(WORKER_COUNT);
std::vector<std::future<bson_t*>> futures;
bson_t document;
rocksdb::Iterator* db_it = _db->NewIterator(rocksdb::ReadOptions());
for (db_it->SeekToFirst(); db_it->Valid(); db_it->Next())
{
const char* bson_data = (const char*)db_it->value().data();
int bson_length = db_it->value().size();
futures.push_back(pool.enqueue(process_document, bson_data, bson_length));
}
delete db_it;
for (auto it = futures.begin(); it != futures.end(); ++it)
{
bson_t* result = it->get();
if (result)
{
count += 1;
if (limit < 0 || count < limit)
{
callback(nullptr, result);
}
delete result;
}
}
if (matcher)
{
mongoc_matcher_destroy(matcher);
}
}
With simple documents and query, the single-thread version processes 1 million documents in 0.5 second on my machine.
With the same documents and query, the multi-thread version processes 1 million documents in 3.3 seconds.
Surprisingly, the multi-thread version is way slower. Moreover, I measured the execution time and 75% of the time is spent in the for loop. So basically the line futures.push_back(pool.enqueue(process_document, bson_data, bson_length)); takes 75% of the time.
I did the following:
I checked the value of WORKER_COUNT, it is 6 on my machine.
I tried to add futures.reserve(1000000), thinking that maybe the vector re-allocation was at fault, but it didn't change anything.
I tried to remove the dynamic memory allocations (bson_t* document = new bson_t();), it didn't change the result significantly.
So my question is: is there something that I did wrong for the multi-thread version to be that slower than the single-thread version?
My current understanding is that the synchronization operations of the thread pool (when tasks are enqueued and dequeued) are simply consuming the majority of the time, and the solution would be to change the data-structure. Thoughts?
Parallelization has overhead.
It takes around 500 nanoseconds to process each document in the single-threaded version. There's a lot of bookkeeping that has to be done to delegate work to a thread-pool (both to delegate the work, and to synchronize it afterwards), and all that bookkeeping could very well require more than 500 nanoseconds per job.
Assuming your code is correct, then the bookkeeping takes around 2800 nanoseconds per job. To get a significant speedup from parallelization, you're going to want to break the work into bigger chunks.
I recommend trying to process documents in batches of 1000 at a time. Each future, instead of corresponding to just 1 document, will correspond to 1000 documents.
Other optimizations
If possible, avoid unnecessary copying. If something gets copied a bunch, see if you can capture it by reference instead of by value.

c++ My timer is unstable Win8.1 VisualStudio

I do not know how to make stable timer.
Timer did should work every 4ms.
I did use normal event timer in "WndProc()", "CreateTimerQueueTimer()",own timer using "std::chrono::high_resolution_clock" and all timers are unstable. Of course I use one. Sometimes they are calculated at the appointed time or sometimes they are too slow what make my program not smooth. Operations are not complicated is add some integers.
CreateTimerQueueTimer()
if (!CreateTimerQueueTimer(&hLogicTimer, hTimerQ, (WAITORTIMERCALLBACK)Window::LogicThread, NULL, 0, 4, NULL)) return -1;
Window::LogicThread() function.
void Window::LogicThread()
{
Window::StartRender = false;
SC->CalculateActors();
Window::StartRender = true;
}
Own function timer called by new thread.
bool Window::LogicThread()
{
typedef std::chrono::high_resolution_clock Time;
auto start = Time::now();
const auto constWait = 4000000;
while (!Window::TerminaterThread)
{
std::this_thread::sleep_for(std::chrono::nanoseconds(constWait - std::chrono::duration_cast<std::chrono::nanoseconds>(Time::now() - start).count()));
start = Time::now(); //calculate operation time and sleep thread
Window::StartRender = false;
SC->CalculateActors();
Window::StartRender = true;
}
return true;
}

"printf" appears to be non-deterministic in Qt?

I know "printf" is standard-c and should be deterministic. But when run in Qt I see a more non-deterministic response(clock cycles). Could this be due to Qt adding some "pork" to its response?
I have multiple threads that make call to function that uses a mutex. When one thread enters it set a switch so the others can't until it is done. Things appeared to work ok for acouple seconds and then threads appeared to be killed off from 10 to 1 thread. So I tried adding a delay: (k=k+1: no help), then (looping k=k+1: no help), (usleep works), and so does (printf) work at creating a random delay and allowing all threads to continue running.
void CCB::Write(int iThread)
{
static bool bUse = false;
bool bDone = false;
char cStr[20];
int posWrite;// = *m_posWrite; // issue of posWrite be altered with next extrance
long k = 0;
long m = 0;
m_threadCount++;
while(bDone == false){
if(bUse == false){
bUse = true;
posWrite = *m_posWrite;
memcpy(m_cmMessageCB + posWrite, &m_cmMessageWrite, sizeof(typeCanMessage));
memset(cStr, '\0', 20);
memcpy(cStr, (m_cmMessageCB + posWrite)->cMessage, 11); //fails: every 20
*m_posWrite = *m_posWrite + 1;
if(*m_posWrite == m_iNBufferLength)
*m_posWrite = 0;
bDone = true;
bUse = false;
}else if(bUse == true){
//why are threads being killed ?
// printf("T%d_%d ", iThread, m_threadCount);//non-deterministic value ?
usleep(1);//non-deterministic value
//k++;//delay of a couple clock cycles was not enough
/*
for(k = 0; k < iThread * 100; k++){//deterministic and fails to resolve thread problem
m++;
}
*/
}
}
}