Infiniband RDMA completion queues with multithreading - c++

I'm learning how to use RDMA via Inifniband and one problem I'm having is using a connection with more than 1 thread because I cant figure out how to create another completion queue so the work completions get mixed up between the threads and it craps out, how do I create a queue for each thread using the connection?
Take this vomit for example:
void worker(struct ibv_cq* cq){
while(conn->peer_mr.empty()) Sleep(1);
struct ibv_wc wc{};
struct ibv_send_wr wr{};
memset(&wr, 0, sizeof wr);
struct ibv_sge sge{};
sge.addr = reinterpret_cast<unsigned long long>(conn->rdma_memory_region);
sge.length = RDMA_BUFFER_SIZE;
sge.lkey = conn->rdma_mr->lkey;
wr.wr_id = reinterpret_cast<unsigned long long>(conn);
wr.opcode = IBV_WR_RDMA_READ;
wr.sg_list = &sge;
wr.num_sge = 1;
wr.send_flags = IBV_SEND_SIGNALED;
struct ibv_send_wr* bad_wr = nullptr;
while(true){
if(queue >= maxqueue) continue;
for(auto i = 0ULL; i < conn->peer_mr.size(); ++i){
wr.wr.rdma.remote_addr = reinterpret_cast<unsigned long long>(conn->peer_mr[i]->mr.addr) + conn->peer_mr[i]->offset;
wr.wr.rdma.rkey = conn->peer_mr[i]->mr.rkey;
const auto err = ibv_post_send(conn->qp, &wr, &bad_wr);
if(err){
std::cout << "ibv_post_send " << err << "\n" << "Errno: " << std::strerror(errno) << "\n";
exit(err);
}
++queue;
conn->peer_mr[i]->offset += RDMA_BUFFER_SIZE;
if(conn->peer_mr[i]->offset >= conn->peer_mr[i]->mr.length) conn->peer_mr[i]->offset = 0;
}
int ne;
do{
ne = ibv_poll_cq(cq, 1, &wc);
} while(!ne);
--queue;
++number;
}
}
If I had more than one of them they would all be receiving each others work completions, I want them to receive only their own and not those of other threads.

The completion queues are created somewhere outside of this code (you are passing in an ibv_cq *). If you'd like to figure out how to create multiple ones, that's the area to focus on.
However, the "crapping out" is not (just) happening because completions are mixed up between threads: the ibv_poll_cq and ibv_post_send functions are thread safe. Instead, the likely problem is that your code isn't thread-safe: there are shared data structures that are accessed without locks (conn->peer_mr). You would have the same issues even without RDMA.
The first step is to figure out how to split up the work into pieces. Think about the pieces that each thread will need to make it independent from the others. It'll likely be a single peer_mr, a separate ibv_cq *, and a specific chunk of your rdma_mr. Then code that :)

Related

Reason for losing messeges over NNG sockets in raw mode

Some context to my problem:
I need to establish an inter-process communication using C++ and sockets and I picked NNG library for that along with nngpp c++ wrapper. I need to use push/pull protocol so no contexts handling is available to me. I wrote some code based on raw example from nngpp demo. The difference here is that, by using push/pull protocol I split this into two separate programs. One for sending and one for receiving.
Problem descripion:
I need to receive let's say a thousand or more messages per second. For now, all messages are captured only when I send about 50/s. That is way too slow and I do believe it can be done faster. The faster I send, the more I lose. At the moment, when sending 1000msg/s I lose about 150 msgs.
Some words about the code
The code may be in C++17 standard. It is written in object-oriented manner so in the end I want to have a class with "receive" method that would simply give me the received messages. For now, I just print the results on screen. Below, I supply some parts of the project with descriptions:
NOTE msgItem is a struct like that:
struct msgItem {
nng::aio aio;
nng::msg msg;
nng::socket_view itemSock;
explicit msgItem(nng::socket_view sock) : itemSock(sock) {}
};
And it is taken from example mentioned above.
Callback function that is executed when message is received by one of the aio's (callback is passed in constructor of aio object). It aims at checking whether everything was ok with transmission, retrieving my Payload (just string for now) and passing it to queue while a flag is set. Then I want to print those messages from the queue using separate thread.
void ReceiverBase<Payload>::aioCallback(void *arg) try {
msgItem *msgItem = (struct msgItem *)arg;
Payload retMsg{};
auto result = msgItem->aio.result();
if (result != nng::error::success) {
throw nng::exception(result);
}
//Here we extract the message
auto msg = msgItem->aio.release_msg();
auto const *data = static_cast<typename Payload::value_type *>(msg.body().data());
auto const count = msg.body().size()/sizeof(typename Payload::value_type);
std::copy(data, data + count, std::back_inserter(retMsg));
{
std::lock_guard<std::mutex> lk(m_msgMx);
newMessageFlag = true;
m_messageQueue.push(std::move(retMsg));
}
msgItem->itemSock.recv(msgItem->aio);
} catch (const nng::exception &e) {
fprintf(stderr, "server_cb: %s: %s\n", e.who(), e.what());
} catch (...) {
fprintf(stderr, "server_cb: unknown exception\n");
}
Separate thread for listening to the flag change and printing. While loop at the end is for continuous work of the program. I use msgCounter to count successful message receival.
void ReceiverBase<Payload>::start() {
auto listenerLambda = [](){
std::string temp;
while (true) {
std::lock_guard<std::mutex> lg(m_msgMx);
if(newMessageFlag) {
temp = std::move(m_messageQueue.front());
m_messageQueue.pop();
++msgCounter;
std::cout << msgCounter << "\n";
newMessageFlag = false;
}}};
std::thread listenerThread (listenerLambda);
while (true) {
std::this_thread::sleep_for(std::chrono::microseconds(1));
}
}
This is my sender application. I tweak the frequency of msg sending by changing the value in std::chrono::miliseconds(val).
int main (int argc, char *argv[])
{
std::string connection_address{"ipc:///tmp/async_demo1"};
std::string longMsg{" here normally I have some long test text"};
std::cout << "Trying connecting sender:";
StringSender sender(connection_address);
sender.setupConnection();
for (int i=0; i<1000; ++i) {
std::this_thread::sleep_for(std::chrono::milliseconds(3));
sender.send(longMsg);
}
}
And this is receiver:
int main (int argc, char *argv[])
{
std::string connection_address{"ipc:///tmp/async_demo1"};
std::cout << "Trying connecting receiver:";
StringReceiver receiver(connection_address);
receiver.setupConnection();
std::cout<< "Connection set up. \n";
receiver.start();
return 0;
}
Nothing speciall in those two applications as You see. the setup method from StringReciver is something like that:
bool ReceiverBase<Payload>::setupConnection() {
m_connected = false;
try {
for (size_t i = 0; i < m_parallel; ++i) {
m_msgItems.at(i) = std::make_unique<msgItem>(m_sock);
m_msgItems.at(i)->aio =
nng::aio(ReceiverBase::aioCallback, m_msgItems.at(i).get());
}
m_sock.listen(m_adress.c_str());
m_connected = true;
for (size_t i = 0; i < m_parallel; ++i) {
m_msgItems.at(i)->itemSock.recv(m_msgItems.at(i)->aio);
}
} catch (const nng::exception &e) {
printf("%s: %s\n", e.who(), e.what());
}
return m_connected;
}
Do You have any suggestions why the performance is so low? Do I use lock_guards properly here? What I want them to do is basically lock the flag and queue so only one side has access to it.
NOTE: Adding more listeners thread does not affect the performance either way.
NOTE2: newMessageFlag is atomic

How can I parallelize transmission and reception of UDP packets in an object

What I have to do is a class which basically has 3 methods :
send(message, dst) which keeps sending messages (and maybe add delay with a sleep(t) and an increasing t) to dst until receiving an ACK.
listen() which receives messages and delivers ACKs. If the message is an ACK, destroys the thread who sent the msg acquitted.
shutdown() which stops every communication (every thread) and writes a log.
For the ACK mechanism, I thought of tables :
host_map[port][ipadress][id] // Also used for other things which require (port,adress) =>id mapping and also because all host have ids.
ACK[id][message][thread_to_stop] // Will be used to destroy threads except I didn't know how to put infos about the thread here and don't know where to put a join() if I even have to
A vector of threads (or threads id, idk) to stop all the threads when I call shutdown().
I want send() and listen() to be parallelized. In other words, listen() should not block the program so it should be in a thread. Also it needs to keep receiving stuff so it would need a while loop.
In my main.cpp :
A link = A(my_ip,port);
A.listen();
for (int i, i < 5, i++)
link.send(to_string(i), dst_ip, dst_port);
This should is supposed to make 5 threads which have while loop where they send i and then sleep a little, repeat until I receive an ACK.
I am new to C++ and never did multi-threading before so I don't even know if I can do this and don't even know where to put my join() if there is any.
Another thing that I thought and don't know if it's possible is to have a queue inside my class Link which keeps sending stuff and have send(msg) just adding it to the queue.
Here is something I made in A.hpp.
void sendm(std::string m, in_addr_t dst_ip, unsigned short dst_port){
int id = resolveId(dst_port, dst_ip);
int seq_number = resolveSeq(); // TODO
std::thread th= (&A::sendMessage, this, m, dst_ip, dst_port);
// I need something to be able to add th or an ID in my ACK table and thread vector.
th.join();
}
void sendMessage(std::string m, in_addr_t dst_ip, unsigned short dst_port){
//dst
struct in_addr dst_ipt;
struct sockaddr_in dst;
dst_ipt.s_addr = dst_ip;
dst.sin_family = AF_INET;
dst.sin_port = htons(dst_port);
dst.sin_addr = dst_ipt;
int t = 50;
while(true){
std::cout << "send message m = " << m << "\n";
//Sends a message to dst_ip through dst_port and increments the number of messages sent.
if (sendto(obj_socket, m.c_str(), m.size(), 0, reinterpret_cast<const sockaddr*>(&dst), sizeof(dst)) < 0){
std::cerr << " Error sendto\n";
exit(EXIT_FAILURE);
};
std::cout<< "message sent\n";
std::this_thread::sleep_for(std::chrono::milliseconds(t));
t+=10;
}
}
void receive(){
char buffer[1500];
sockaddr_in from;
socklen_t fromlen = sizeof(from);
ssize_t tmp = recvfrom(obj_socket, buffer, 1500, 0, reinterpret_cast<sockaddr*>(&from), &fromlen);
if (tmp < 0){
std::cout << "Exit\n";
std::cerr << "Error receive from\n";
exit(EXIT_FAILURE);
} else{
if (tmp >= 0){
int id = resolveId(from.sin_port, from.sin_addr.s_addr);
if (!verifySomething(m,id)){
doSomethingCool(m,id);
}
}
}
}
My listen() would just be a threaded version of while(true)receive();
Idk if this version compiles to be honest. I keep changing it every 2 minutes. Without the while loop in the send() and the threading, it works so far.
I didn't really implement the ACK mechanism yet.
Thank you for reading and for your help.

Capture value from infinite thread c++

I have created a thread that is running parrallel with main thread. Both threads are doing something infinitely (both have while(true) statement). Main thread while(true) is creating game logic in frames, and second thread is receiveing messages from socket.
Is it possible to get string value of message received from second thread into main thread each frame without returning from second thread?
In c#, I would do it with method invoker but I didn't find anything helpful for c++. Is it possible to perform in c++?
Function which creates thread:
void ReceiveMessage() {
//std::promise<int> p;
//auto f = p.get_future();
char buf[1024];
string usernput;
int bytesReceived = 0;
std::thread receiveMessage(&FactoredThread::ThreadFunction, *this);
receiveMessage.detach();
//pokusajporuke = f.get();
}
ThreadFunction:
void ThreadFunction() {
bytesReceived = 0;
while (true) {
bytesReceived = recv(sock, buf, 1024, 0);
if (bytesReceived > 0) {
string primljeniString = "";
for (int i = 0; i < sizeof(buf); i++) {
if (buf[i] != 0)
{
primljeniString += buf[i];
}
}
ZeroMemory(buf, 1024);
pokusajporuke = primljeniString;
}
}
}
So how to get "pokusajporuke" string for main thread?
Yes, sure. There are many ways of solving this problem.
One way is to use signals and slots, like in Qt. For pure C++ you could use Boost.Signals2, which is thread safe.
Or you can realize pattern producer-consumer. One thread(producer) puts values into buffer(it should be thread-safe buffer), second takes them from there.
I think, for your problem second way is better.
Actually what I needed was global static variable. And then with another method from main thread I put that global variable into class property

Windws C++ Intermittent Socket Disconnect

I've got a server that uses a two thread system to manage between 100 and 200 concurrent connections. It uses TCP sockets, as packet delivery guarantee is important (it's a communication system where missed remote API calls could FUBAR a client).
I've implemented a custom protocol layer to separate incoming bytes into packets and dispatch them properly (the library is included below). I realize the issues of using MSG_PEEK, but to my knowledge, it is the only system that will fulfill the needs of the library implementation. I am open to suggestions, especially if it could be part of the problem.
Basically, the problem is that, randomly, the server will drop the client's socket due to a lack of incoming packets for more than 20 seconds, despite the client successfully sending a keepalive packet every 4. I can verify that the server itself didn't go offline and that the connection of the users (including myself) experiencing the problem is stable.
The library for sending/receiving is here:
short ncsocket::send(wstring command, wstring data) {
wstringstream ss;
int datalen = ((int)command.length() * 2) + ((int)data.length() * 2) + 12;
ss << zero_pad_int(datalen) << L"|" << command << L"|" << data;
int tosend = datalen;
short __rc = 0;
do{
int res = ::send(this->sock, (const char*)ss.str().c_str(), datalen, NULL);
if (res != SOCKET_ERROR)
tosend -= res;
else
return FALSE;
__rc++;
Sleep(10);
} while (tosend != 0 && __rc < 10);
if (tosend == 0)
return TRUE;
return FALSE;
}
short ncsocket::recv(netcommand& nc) {
vector<wchar_t> buffer(BUFFER_SIZE);
int recvd = ::recv(this->sock, (char*)buffer.data(), BUFFER_SIZE, MSG_PEEK);
if (recvd > 0) {
if (recvd > 8) {
wchar_t* lenstr = new wchar_t[4];
memcpy(lenstr, buffer.data(), 8);
int fulllen = _wtoi(lenstr);
delete lenstr;
if (fulllen > 0) {
if (recvd >= fulllen) {
buffer.resize(fulllen / 2);
recvd = ::recv(this->sock, (char*)buffer.data(), fulllen, NULL);
if (recvd >= fulllen) {
buffer.resize(buffer.size() + 2);
buffer.push_back((char)L'\0');
vector<wstring> data = parsewstring(L"|", buffer.data(), 2);
if (data.size() == 3) {
nc.command = data[1];
nc.payload = data[2];
return TRUE;
}
else
return FALSE;
}
else
return FALSE;
}
else
return FALSE;
}
else {
::recv(this->sock, (char*)buffer.data(), BUFFER_SIZE, NULL);
return FALSE;
}
}
else
return FALSE;
}
else
return FALSE;
}
This is the code for determining if too much time has passed:
if ((int)difftime(time(0), regusrs[i].last_recvd) > SERVER_TIMEOUT) {
regusrs[i].sock.end();
regusrs[i].is_valid = FALSE;
send_to_all(L"removeuser", regusrs[i].server_user_id);
wstringstream log_entry;
log_entry << regusrs[i].firstname << L" " << regusrs[i].lastname << L" (suid:" << regusrs[i].server_user_id << L",p:" << regusrs[i].parent << L",pid:" << regusrs[i].parentid << L") was disconnected due to idle";
write_to_log_file(server_log, log_entry.str());
}
The "regusrs[i]" is using the currently iterated member of a vector I use to story socket descriptors and user information. The 'is_valid' check is there to tell if the associated user is an actual user - this is done to prevent the system from having to deallocate the member of the vector - it just returns it to the pool of available slots. No thread access/out-of-range issues that way.
Anyway, I started to wonder if it was the server itself was the problem. I'm testing on another server currently, but I wanted to see if another set of eyes could stop something out of place or cue me in on a concept with sockets and extended keepalives that I'm not aware of.
Thanks in advance!
I think I see what you're doing with MSG_PEEK, where you wait until it looks like you have enough data to read a full packet. However, I would be suspicious of this. (It's hard to determine the dynamic behaviour of your system just by looking at this small part of the source and not the whole thing.)
To avoid use of MSG_PEEK, follow these two principles:
When you get a notification that data is ready (I assume you're using select), then read all the waiting data from recv(). You may use more than one recv() call, so you can handle the incoming data in pieces.
If you read only a partial packet (length or payload), then save it somewhere for the next time you get a read notification. Put the packets and payloads back together yourself, don't leave them in the socket buffer.
As an aside, the use of new/memcpy/wtoi/delete is woefully inefficient. You don't need to allocate memory at all, you can use a local variable. And then you don't even need the memcpy at all, just a cast.
I presume you already assume that your packets can be no longer than 999 bytes in length.

strange behavior in concurrently executing a function for objects in queue

My program has a shared queue, and is largely divided into two parts:
one for pushing instances of class request to the queue, and the other accessing multiple request objects in the queue and processing these objects. request is a very simple class(just for test) with a string req field.
I am working on the second part, and in doing so, I want to keep one scheduling thread, and multiple (in my example, two) executing threads.
The reason I want to have a separate scheduling thread is to reduce the number of lock and unlock operation to access the queue by multiple executing threads.
I am using pthread library, and my scheduling and executing function look like the following:
void * sched(void* elem) {
queue<request> *qr = static_cast<queue<request>*>(elem);
pthread_t pt1, pt2;
if(pthread_mutex_lock(&mut) == 0) {
if(!qr->empty()) {
int result1 = pthread_create(&pt1, NULL, execQueue, &(qr->front()));
if (result1 != 0) cout << "error sched1" << endl;
qr->pop();
}
if(!qr->empty()) {
int result2 = pthread_create(&pt2, NULL, execQueue, &(qr->front()));
if (result2 != 0) cout << "error sched2" << endl;
qr->pop();
}
pthread_join(pt1, NULL);
pthread_join(pt2, NULL);
pthread_mutex_unlock(&mut);
}
return 0;
}
void * execQueue(void* elem) {
request *r = static_cast<request*>(elem);
cout << "req is: " << r->req << endl; // req is a string field
return 0;
}
Simply, each of execQueue has one thread to be executed on, and just outputs a request passed to it through void* elem parameter.
sched is called in main(), with a thread, (in case you're wondering how, it is called in main() like below)
pthread_t schedpt;
int schresult = pthread_create(&schedpt, NULL, sched, &q);
if (schresult != 0) cout << "error sch" << endl;
pthread_join(schedpt, NULL);
and the sched function itself creates multiple(two in here) executing threads and pops requests from the queue, and executes the requests by calling execQueue on multiple threads(pthread_create and then ptrhead_join).
The problem is the weird behavior by the program.
When I checked the size and the elements in the queue without creating threads and calling them on multiple threads, they were exactly what I expected.
However, when I ran the program with multiple threads, it prints out
1 items are in the queue.
2 items are in the queue.
req is:
req is: FIRST! �(x'�j|1��rj|p�rj|1����FIRST!�'�j|!�'�j|�'�j| P��(�(��(1���i|p��i|
with the last line constantly varying.
The desired output is
1 items are in the queue.
2 items are in the queue.
req is: FIRST
req is: FIRST
I guess either the way I call the execQueue on multiple threads, or the way I pop() is wrong, but I could not figure out the problem, nor could I find any source to refer to for a correct usage.
Please help me on this. Bear with me for clumsy use of pthread, as I am a beginner.
Your queue holds objects, not pointers to objects. You can address the object at the front of the queue via operator &() as you are, but as soon as you pop the queue that object is gone and that address is no longer valid. Of course, sched doesn't care, but the execQueue function you sent that address do certainly does.
The most immediate fix for your code is this:
Change this:
pthread_create(&pt1, NULL, execQueue, &(qr->front()));
To this:
// send a dynamic *copy* of the front queue node to the thread
pthread_create(&pt1, NULL, execQueue, new request(qr->front()));
And your thread proc should be changed to this:
void * execQueue(void* elem)
{
request *r = static_cast<request*>(elem);
cout << "req is: " << r->req << endl; // req is a string field
delete r;
return nullptr;
}
That said, I can think of better ways to do this, but this should address your immediate problem, assuming your request object class is copy-constructible, and if it has dynamic members, follows the Rule Of Three.
And here's your mildly sanitized c++11 version just because I needed a simple test thingie for MSVC2013 installation :)
See it Live On Coliru
#include <iostream>
#include <thread>
#include <future>
#include <mutex>
#include <queue>
#include <string>
struct request { std::string req; };
std::queue<request> q;
std::mutex queue_mutex;
void execQueue(request r) {
std::cout << "req is: " << r.req << std::endl; // req is a string field
}
bool sched(std::queue<request>& qr) {
std::thread pt1, pt2;
{
std::lock_guard<std::mutex> lk(queue_mutex);
if (!qr.empty()) {
pt1 = std::thread(&execQueue, std::move(qr.front()));
qr.pop();
}
if (!qr.empty()) {
pt2 = std::thread(&execQueue, std::move(qr.front()));
qr.pop();
}
}
if (pt1.joinable()) pt1.join();
if (pt2.joinable()) pt2.join();
return true;
}
int main()
{
auto fut = std::async(sched, std::ref(q));
if (!fut.get())
std::cout << "error" << std::endl;
}
Of course it doesn't actually do much now (because there's no tasks in the queue).