Multithreading has no effect in performance

Multithreading has no effect in performance - c++

I am working on my packet sniffer. there is a workerConnection() function that performs main processes on packets as follow .
It reads a packet from _Connection_buffer that is defined as std::string* _Connection_buffer= new std::string[_Max_Element_Connection_buffer], then splits the packets by its white spaces and performs some actions on them.
everythings works well except performance of parallel processing of workerConnection().
As it can be seen in the workerConnection(), in the first loop, each thread takes a unique number as its own 'MyID'. then uses its MyID as an offset for reading packets from _Connection_buffer.
I am sure by parallel processing of workerConnection() each thread does not process all packets and only process (_Max_Element_Connection_buffer/_ConnectionThreadCount) packets.
void
workerConnection() {
short MyID = 0;
_Connection_tn->_mtx.lock();
for (int n = 0; n < _ConnectionThreadCount; n++) {
if (_Connection_tn->getMyNumber[n] != 100) {
MyID = _Connection_tn->getMyNumber[n];
_Connection_tn->getMyNumber[n] = 100;
break;
}
}
_Connection_tn->_mtx.unlock();
LOG_INFO("Session Worker :%d started", MyID);
long int index = 0;
s_EntryItem* entryItem = NULL;
uint32_t srcIP_num;
uint64_t p_counter=0;
std::string connectionPayload1 = ""; connectionPayload1.reserve(40);
std::string connectionPayload2 = ""; connectionPayload2.reserve(40);
std::string connectionPayload = ""; connectionPayload.reserve(150);
std::string part; part.reserve(30);
std::string srcIP_str; srcIP_str.reserve(15);
std::string t; t.reserve(50);
std::string line; line.reserve(_snapLengthConnection + 1);
std::string Protocol; Protocol.reserve(3);
std::string srcIP_port; Protocol.reserve(7);
std::vector<std::string> _CStorage(_storage_Count_Connection);
struct timeval start, end;
long mtime, secs, usecs;
gettimeofday(&start, NULL);
for (index = MyID; index < _Max_Element_Connection_buffer; index+=_ConnectionThreadCount) {
if (unlikely(p_counter++ >= (_Connection_flag->_Number - MyID)) ) {
LOG_INFO("sleeped");
boost::this_thread::sleep(boost::posix_time::milliseconds(20));
continue;
}
line = _Connection_buffer[index];
if(line.size()<_storage_Count_Connection)
continue;
part = line.substr(_column_ConnectionConditionLocation, ConnectionCondition.length() + 5);
if (part.find("My Packet") == std::string::npos) { //assume always false.
LOG_INFO("Part: %s, Condition: %s", part.c_str(), ConnectionCondition.c_str());
continue;
}
boost::split(_CStorage,line,boost::is_any_of(" "));
t = _CStorage[_column_ConnectionUserIP];
auto endSource = t.find("/", 0);
srcIP_str = t.substr(0, endSource);
try {
srcIP_num = boost::asio::ip::address_v4::from_string(srcIP_str).to_ulong();
} catch (...) {
continue;
}
entryItem = searchIPTable(srcIP_num);
if (entryItem == NULL) {
continue;
}
int ok = processConnection(srcIP_num, connectionPayload1, connectionPayload2, &_CStorage);
if (!ok) {
continue;
}
auto startSource = t.find("/", 8) + 1;
endSource = t.find("-", startSource);
connectionPayload2+=t.substr(startSource, endSource - startSource);
connectionPayload2+=_CStorage[_column_ConnectionProtocol];
entryItem->_mtx.lock();
if (entryItem->InUse != false) {
connectionPayload = entryItem->payload1;
connectionPayload += connectionPayload1;
connectionPayload += entryItem->payload2;
connectionPayload += connectionPayload2;
}
entryItem->_mtx.unlock();
}
gettimeofday(&end, NULL);
secs = end.tv_sec - start.tv_sec;
usecs = end.tv_usec - start.tv_usec;
mtime = ((secs) * 1000 + usecs/1000.0) + 0.5;
LOG_INFO("Worker %d :Elapsed time for %lu times running Connection Worker is %ld millisecs\n",MyID,index, mtime);
LOG_INFO("Connection Worker %d :stopped",MyID);
}
so now, there is a problem that drives me crazy . processing all the packets using one or 2 or n threads takes equal time.
as said earlier, on multithreading, i am sure each threads process only _Max_Element_Connection_buffer/_ConnectionThreadCount packets.
I ran my program on a virtual ubuntu 14.04 which has 20G RAM and 4 CPUs. the host OS windows 8 and has 8 CPUs.
the specification of one of my vitual ubuntu processor is shown below:
processor : 3
vendor_id : GenuineIntel
cpu family : 6
model : 60
model name : Intel(R) Core(TM) i5-4460 CPU # 3.20GHz
stepping : 3
cpu MHz : 3192.718
cache size : 6144 KB
physical id : 0
siblings : 4
core id : 3
cpu cores : 4
the results of processing my program with 1, 2, 4 and 8 threads are as follow:
Number of _Connection_buffer elements= 10,000
using one thread:
Worker 0 :Elapsed time for 10000 times running Connection Worker is 149 millisecs
using 2 threads:
Worker 0 :Elapsed time for 10000 times running Connection Worker is 122 millisecs
Worker 1 :Elapsed time for 10001 times running Connection Worker is 127 millisecs
using 4 threads:
Worker 0 :Elapsed time for 10000 times running Connection Worker is 127 millisecs
Worker 1 :Elapsed time for 10002 times running Connection Worker is 129 millisecs
Worker 2 :Elapsed time for 10001 times running Connection Worker is 138 millisecs
Worker 3 :Elapsed time for 10003 times running Connection Worker is 140 millisecs
using 8 threads.
Worker 0 :Elapsed time for 10007 times running Connection Worker is 135 millisecs
Worker 1 :Elapsed time for 10000 times running Connection Worker is 154 millisecs
Worker 2 :Elapsed time for 10002 times running Connection Worker is 153 millisecs
Worker 3 :Elapsed time for 10003 times running Connection Worker is 158 millisecs
Worker 4 :Elapsed time for 10006 times running Connection Worker is 169 millisecs
Worker 5 :Elapsed time for 10004 times running Connection Worker is 170 millisecs
Worker 6 :Elapsed time for 10005 times running Connection Worker is 176 millisecs
Worker 7 :Elapsed time for 10001 times running Connection Worker is 178 millisecs

Related

Boost::serial_port and asio::async_write for byte[256] exceeds 200ms

I am implementing some basic communication over serial port.
According to protocol I must respond with answer within 225ms after receiving the request. Maximum package size is 256B.
So, when I receive request I create response [header, lenght, payload, crc16] (total 256 B) for that I need in average 30 - 40 ms. Then the actual problem occurs when I put that response(byte array) in asio::async_write. That function take around 240ms in average to process that byte array.
Everything works fine except when I sending maximum length packages. It takes 240ms (asio::async_write) + 40ms(package assembly) around 280 ~ 300 ms.
Port: 9600 baud, length 8, one stop bit
Any idea how can I speed it up?
void Example::do_write()
{
if (pimpl_->WriteBuffer == nullptr)
{
boost::lock_guard<boost::mutex> l(pimpl_->WriteQueueMutex);
pimpl_->WriteBufferSize = pimpl_->WriteQueue.size();
pimpl_->WriteBuffer.reset(new byte[pimpl_->WriteQueue.size()]);
std::move(pimpl_->WriteQueue.begin(), pimpl_->WriteQueue.end(), pimpl_->WriteBuffer.get());
pimpl_->WriteQueue.clear();
begin = boost::chrono::steady_clock::now();
async_write(pimpl_->Port, asio::buffer(pimpl_->WriteBuffer.get(), pimpl_->WriteBufferSize), boost::bind(&Example::write_end, this, asio::placeholders::error));
}
}
void Example::write_end(const system::error_code& error)
{
if (!error)
{
boost::lock_guard<boost::mutex> l(pimpl_->WriteQueueMutex);
if (pimpl_->WriteQueue.empty())
{
pimpl_->WriteBuffer.reset();
pimpl_->WriteBufferSize = 0;
end = boost::chrono::steady_clock::now();
OutputDebugString(string("\nWRITE TIME: " + to_string(boost::chrono::duration_cast<boost::chrono::milliseconds>(end - begin).count()) + "\n").c_str());
return;
}
pimpl_->WriteBufferSize = pimpl_->WriteQueue.size();
pimpl_->WriteBuffer.reset(new byte[pimpl_->WriteQueue.size()]);
std::move(pimpl_->WriteQueue.begin(), pimpl_->WriteQueue.end(), pimpl_->WriteBuffer.get());
pimpl_->WriteQueue.clear();
async_write(pimpl_->Port, asio::buffer(pimpl_->WriteBuffer.get(), pimpl_->WriteBufferSize), boost::bind(&Example::write_end, this, asio::placeholders::error));
}
else
{
set_error_status(true);
do_close();
}
}

In my experience boost::asio itself takes fractions of microseconds. You use 40 ms to fetch the data, communication takes 220 ms (it is about minimum to send 256 bytes over 9600 baud) and somewhere you waste more 20-40 ms and sum is 280 - 300 ms.
What to do?
Most profitable is likely to increase the baud rate to 115200 baud (0.85 ms/byte of 9600 baud versus 0.07 ms/byte of 115200 baud).
Next idea is to profile out where those 20-40 ms go (likely something unneeded in message loop you wrote).
Finally it is likely possible also to reduce those data fetching 40 ms.

pausing function and continue from a certain point in a robot movement

i have a question on how to program a certain sequence for my robot.
lets say if i would to program to make it run from position a to b, i have a sensor attach to it that should it detect x, it would perform an action called y at the position where it detect x, where action y doesnt change its position.
i would like the robot to continue from where it left after performing action y to go towards b. however i do not know how to pause the sequence from a to b and continue from where it left off after performing action y. i am controlling only the motor of the wheels and its timing so i can only set the speed of the wheels for a certain time.
is there a pause function in general( not sleep) in c++ and to continue running its lines of code from where it paused?
for now i do know how to reset its action but thats not what i want.
example( making the robot move from a to b in 10 seconds, detect object x at 3 seconds, do action y at position when t =3 seconds, continue motion for remaining 7 seconds after action y has been done)

You can try to use some event (message) driving architecture like the following pseudo code:
vector<action> action_sequence={do_a,do_b,do_c};
int i=0;
while(1)
{
e = WaitForMessage();
switch(e.code)
{
case do_action:
action_sequence[i].run();//perform an action
...//some code to have a scheduler thread to send next
//do_action message in certain seconds to this thread.
i++;
default:
...
}
}

The answer would depend on your code, are you using windows messaging, are you using thread, etc. Assuming that you are using neither, just linear code you could implement your own sleep function which is passed a function by the caller which is used to access if the sleep should be preempted. If preempted, then the function returns the time left so the action can be continued later.
This allows linear code to handle your situation. I knocked up a sample. Will explain the bits.
typedef bool (*preempt)(void);
DWORD timedPreemptableAction (DWORD msTime, preempt fn)
{
DWORD startTick = GetTickCount();
DWORD endTick = startTick + msTime;
DWORD currentTick;
do
{
currentTick = GetTickCount();
}
while (fn () == false && currentTick < endTick);
return currentTick > endTick ? 0 : endTick-currentTick;
}
The key function above, obtains start time in milliseconds, and will not exit until the timeout expires - or the user provided function returns true.
This user provided function could poll input devices such as a keyboard press etc. For now to match your question, I have added a user function which returns true, after 3 seconds:
DWORD startToTurnTicks = 0;
bool startToTurn (void)
{
bool preempt = false;
// TODO Implement method of preemption. For now
// just use a static timer, yep a little hacky
//
// 0 = uninitialized
// 1 = complete, no more interruptions
// >1 = starting tick count
if (startToTurnTicks == 0)
{
startToTurnTicks = GetTickCount();
}
else
{
if (startToTurnTicks != 1)
{
if ((startToTurnTicks + 3000) < GetTickCount())
{
startToTurnTicks = 1;
preempt = true;
}
}
}
return preempt;
}
Now we have a function which waits for N time and can exit, and a user function which will return true after 3 seconds, now the main call:
bool neverPreempt (void)
{
return false;
}
int main (void)
{
int appResult = 0;
DWORD moveForwardTime = 1000*10;
DWORD turnTime = 1000*3;
DWORD startTicks = GetTickCount();
printf ("forward : %d seconds in\n",
(GetTickCount()-startTicks)/1000);
moveForwardTime = timedPreemptableAction (moveForwardTime, &startToTurn);
printf ("turn : %d seconds in\n",
(GetTickCount()-startTicks)/1000);
turnTime = timedPreemptableAction (turnTime, &neverPreempt);
printf ("turn complete : %d seconds in\n",
(GetTickCount()-startTicks)/1000);
if (moveForwardTime > 0)
{
printf ("forward again : %d seconds in\n",
(GetTickCount()-startTicks)/1000);
moveForwardTime = timedPreemptableAction (moveForwardTime, &neverPreempt);
printf ("forward complete : %d seconds in\n",
(GetTickCount()-startTicks)/1000);
}
return appResult;
}
In the main code you see timedPreemptableAction is called 3 times. The first time we pass the user function which turns true after 3 seconds. This first call exits after three seconds returning 7 seconds left. The output from the app returns:
f:\projects\cmake_test\build\Debug>main
forward : 0 seconds in
turn : 3 seconds in
turn complete : 6 seconds in
forward again : 6 seconds in
forward complete : 13 seconds in
Started to move forward #0 seconds, "paused" #3 seconds, restored #6 and finished #13.
0->3 + 6->13 = 10 seconds.

Why Win32 api _beginthreadex/CreateThread leaks when using CRT or not?

I have the following function that is executed on a Thread created with _beginthreadex or CreateThread:
static volatile LONG g_executedThreads = 0;
void executeThread(int v){
//1. leaks: time_t tt = _time64(NULL);
//2. leaks: FILETIME ft; GetSystemTimeAsFileTime(&ft);
//3. no leak: SYSTEMTIME stm; GetSystemTime(&stm);
InterlockedAdd(&g_executedThreads, 1); // count nr of executions
}
When I uncomment any of lines 1. (crt call) or 2. (win 32 api call) the thread leaks and next calls of _begintreadex will fail( GetLastError -> return error (8) -> Not enough storage is available to process this command).
Memory reported by Process Explorer, when _beginthreadex starts to fail:
Private 130 Mb, Virtual 150 Mb.
But if I uncomment only line 3. (other win 32 api call) no leak happens, and no fail after 1 million of threads. Here Memory reported is Private 1.4 Mb, Virtual 25 Mb. And this version ran very fast (20 secs for 1 million threads vs the first one that took 60 secs for 30000).
I've tested (see here the test code ) with Visual Studio 2013, compiled x86 (debug and release) and ran on Win 8.1 x64; After creating 30000 of threads _beginthreadex starts failing (most of the calls); I want to mention that simultaneous running threads are under 100.
Updated 2:
My assumption of max 100 threads was based on console output (scheduled is aprox equal with completed) and Process Explorer in Threads Tab did not report more then 10 threads_)
Here is the console output (no WaitForSingleObject, original code):
step:0, scheduled:1, completed:1
step:5000, scheduled:5001, completed:5000
...
step:25000, scheduled:25001, completed:24999
step:30000, scheduled:30001, completed:30001
_beginthreadex failed. err(8); errno(12). exiting ...
step:31701, scheduled:31712, completed:31710
rerun loop:
step:0, scheduled:31713, completed:31711
_beginthreadex failed. err(8); errno(12). exiting ...
step:6, scheduled:31719, completed:31716
Based on #SHR & #HarryJohnston suggestion I've scheduled 64 threads at once, and wait all to complete, (see updated code here), but the behaviour is same. Note I've tried single thread once, but the fail happens sporadic. Also the Reserved Stack size is 64K!
Here is the new schedule function:
static unsigned int __stdcall _beginthreadex_wrapper(void *arg) {
executeThread(1);
return 0;
}
const int maxThreadsCount = MAXIMUM_WAIT_OBJECTS;
bool _beginthreadex_factory(int& step) {
DWORD lastError = 0;
HANDLE threads[maxThreadsCount];
int threadsCount = 0;
while (threadsCount < maxThreadsCount){
unsigned int id;
threads[threadsCount] = (HANDLE)_beginthreadex(NULL,
64 * 1024, _beginthreadex_wrapper, NULL, STACK_SIZE_PARAM_IS_A_RESERVATION, &id);
if (threads[threadsCount] == NULL) {
lastError = GetLastError();
break;
}
else threadsCount++;
}
if (threadsCount > 0) {
WaitForMultipleObjects(threadsCount, threads, TRUE, INFINITE);
for (int i = 0; i < threadsCount; i++) CloseHandle(threads[i]);
}
step += threadsCount;
g_scheduledThreads += threadsCount;
if (threadsCount < maxThreadsCount) {
printf(" %03d sec: step:%d, _beginthreadex failed. err(%d); errno(%d). exiting ...\n", getLogTime(), step, lastError, errno);
return false;
}
else return true;
}
Here is what is printed on Console:
000 sec: step:6400, scheduled:6400, completed:6400
003 sec: step:12800, scheduled:12800, completed:12800
007 sec: step:19200, scheduled:19200, completed:19200
014 sec: step:25600, scheduled:25600, completed:25600
022 sec: step:32000, scheduled:32000, completed:32000
023 sec: step:32358, _beginthreadex failed. err(8); errno(12). exiting ...
sleep 5 seconds
028 sec: step:32358, scheduled:32358, completed:32358
try to create 2 more times
028 sec: step:32361, _beginthreadex failed. err(8); errno(12). exiting ...
032 sec: step:32361, scheduled:32361, completed:32361
rerun loop: 1
036 sec: step:3, _beginthreadex failed. err(8); errno(12). exiting ...
sleep 5 seconds
041 sec: step:3, scheduled:32364, completed:32364
try to create 2 more times
041 sec: step:5, _beginthreadex failed. err(8); errno(12). exiting ...
045 sec: step:5, scheduled:32366, completed:32366
rerun loop: 2
056 sec: step:2, _beginthreadex failed. err(8); errno(12). exiting ...
sleep 5 seconds
061 sec: step:2, scheduled:32368, completed:32368
try to create 2 more times
061 sec: step:4, _beginthreadex failed. err(8); errno(12). exiting ...
065 sec: step:4, scheduled:32370, completed:32370
Any suggestion/info is welcome.
Thanks.

I guess you get it wrong.
Take a look at this code:
int thread_func(void* p)
{
Sleep(1000);
return 0;
}
int main()
{
LPTHREAD_START_ROUTINE s = (LPTHREAD_START_ROUTINE)&thread_func;
for(int i=0;i<1000000;i++)
{
DWORD id;
HANDLE h = CreateThread(NULL,0, s,NULL,0,&id);
WaitForSingleObject(h,INFINITE);
}
return 0;
}
A leaking thread will leak just because you calling it, so the wait doesn't chang a thing, but when you look at this in performance monitor, you'll see all lines are almost constant.
Now ask yourself, what will happen when I remove the WaitForSingleObject?
The creation of threads run much faster then the threads, so you reach the threads limit per proccess, or memory limit per process.
Note that if you are compiling for x86, memory is limited to 4GB but only 2GB is used for user mode memory and the other 2GB used for kernel mode memory. if you are using the default stack size (1MB) for thread, and the rest of the program doesn't use memory at all (it's never happen, since you have code...), then you are limited to 2000 threads. after the 2GB finished you can't create more threads until previous threads will over.
So, my conclusion is that you creating threads and don't wait, and after some period, no memory left for more threads.
You can check if this is the case with performance monitor and check the max threads per your process.

After uninstall the antivirus the failure could not be reproduced (even the code is run as fast as for the other scenario 3.).

boost thread while all thread not completed print something

need to know that
boost::thread_group tgroup;
loop 10 times
tgroup.create_thread( boost::bind( &c , 2, 2, ) )
<==
tgroup.join_all()
What can i do at the <== location above to continuously print the number of threads who have completed there jobs

You can use an atomic counter: See it Live On Coliru
#include <boost/thread/thread.hpp>
#include <boost/atomic.hpp>
static boost::atomic_int running_count(20);
static void worker(boost::chrono::milliseconds effort)
{
boost::this_thread::sleep_for(effort);
--running_count;
}
int main()
{
boost::thread_group tg;
for (int i = 0, count = running_count; i < count; ++i) // count protects against data race!
tg.create_thread(boost::bind(worker, boost::chrono::milliseconds(i*50)));
while (running_count > 0)
{
std::cout << "Monitoring threads: " << running_count << " running\n";
boost::this_thread::sleep_for(boost::chrono::milliseconds(100));
}
tg.join_all();
}
Example output:
Monitoring threads: 19 running
Monitoring threads: 17 running
Monitoring threads: 15 running
Monitoring threads: 13 running
Monitoring threads: 11 running
Monitoring threads: 9 running
Monitoring threads: 7 running
Monitoring threads: 5 running
Monitoring threads: 3 running
Monitoring threads: 1 running
Another way would be to use a semaphore

The easiest way would be to print thread id at the end of job "c":
void c()
{
//some code here
some_safe_print << boost::this_thread::get_id();
}
That way, when thread will finish, the last instruction would be to print it's id.

pthread scheduling problems

I have two threads in a producer-consumer pattern. Code works, but then the consumer thread will get starved, and then the producer thread will get starved.
When working, program outputs:
Send Data...semValue = 1
Recv Data...semValue = 0
Send Data...semValue = 1
Recv Data...semValue = 0
Send Data...semValue = 1
Recv Data...semValue = 0
Then something changes and threads get starved, program outputs:
Send Data...semValue = 1
Send Data...semValue = 2
Send Data...semValue = 3
...
Send Data...semValue = 256
Send Data...semValue = 257
Send Data...semValue = 258
Recv Data...semValue = 257
Recv Data...semValue = 256
Recv Data...semValue = 255
...
Recv Data...semValue = 0
Send Data...semValue = 1
Recv Data...semValue = 0
Send Data...semValue = 1
Recv Data...semValue = 0
I know threads are scheduled by the OS, and can run at different rates and in random order. My question: When I do a YieldThread(calls pthread_yield), shouldn’t the Talker give Listener a chance to run? Why am I getting this bizarre scheduling?
Snippet of Code below. Thread class and Semaphore class are abstractions classes. I went ahead as stripped out the queue for data passing between the threads so I could eliminate that variable.
const int LOOP_FOREVER = 1;
class Listener : public Thread
{
public:
Listener(Semaphore* dataReadySemaphorePtr)
: Thread("Listener"),
dataReadySemaphorePtr(dataReadySemaphorePtr)
{
//Intentionally left blank.
}
private:
void ThreadTask(void)
{
while(LOOP_FOREVER)
{
this->dataReadySemaphorePtr->Wait();
printf("Recv Data...");
YieldThread();
}
}
Semaphore* dataReadySemaphorePtr;
};
class Talker : public Thread
{
public:
Talker(Semaphore* dataReadySemaphorePtr)
: Thread("Talker"),
dataReadySemaphorePtr(dataReadySemaphorePtr)
{
//Intentionally left blank
}
private:
void ThreadTask(void)
{
while(LOOP_FOREVER)
{
printf("Send Data...");
this->dataReadySemaphorePtr->Post();
YieldThread();
}
}
Semaphore* dataReadySemaphorePtr;
};
int main()
{
Semaphore dataReadySemaphore(0);
Listener listener(&dataReadySemaphore);
Talker talker(&dataReadySemaphore);
listener.StartThread();
talker.StartThread();
while (LOOP_FOREVER); //Wait here so threads can run
}

No. Unless you are using a lock to prevent it, even if one thread yields it's quantum, there's no requirement that the other thread receives the next quantum.
In a multithreaded environment, you can never ever ever make assumptions about how processor time is going to be scheduled; if you need to enforce correct behavior, use a lock.

Believe it or not, it runs that way because it's more efficient. Every time the processor switches between threads, it performs a context switch that wastes a certain amount of time. My advice is to let it go unless you have another requirement like a maximum latency or queue size, in which case you need another semaphore for "ready for more data" in addition to your "data ready for listening" one.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Multithreading has no effect in performance - c++

Related

Boost::serial_port and asio::async_write for byte[256] exceeds 200ms

pausing function and continue from a certain point in a robot movement

Why Win32 api _beginthreadex/CreateThread leaks when using CRT or not?

boost thread while all thread not completed print something

pthread scheduling problems

Categories

Resources