getaddrinfo() fails continuously with EAI_AGAIN - c++

In my code I am using the code as follows.
do
{
r = getaddrinfo(host, service, &hints, ret);
}
while (r == EAI_AGAIN);
when testing getaddrinfo() continuously fails thus loop not terminates properly.
Do you see any way to improve the code? can we use counter to count for number of times it should loop?
Also please let me know for what are all the reasons "EAI_AGAIN" returned by getaddrinfo() call.

Here is, admittedly, a wild guess.
We're also seeing this on a slightly underpowered single core embedded system.
I assume (in our case dnsmasq) is running in a separate process, and for whatever reason (probably because we're running around in circles chasing our tails) it doesn't get enough resources (cpu/ram/...) to do its job.
A wild guess at a solution might be to put a sleep into that tight loop and let the DNS caching magic at the resources it needs to do it's work.
I will let you know if it works.

Related

Internal error when using MPI Intel library with reduction operation on communicators

I am having some issues when using reduction operations on MPI communicators.
I have a lots of different communicators created using the algorithm this way :
MPI_ERR_SONDAGE(MPI_Group_incl(world_group, comm_size, &(on_going_communicator[0]), &local_group));
MPI_ERR_SONDAGE(MPI_Comm_create_group(MPI_COMM_WORLD, local_group, tag, &communicator)); tag++;
When I call a reduction operation like so :
MPI_ERR_SONDAGE(MPI_Allreduce(&(temporary[0]), &(temporary_glo[0]), (int)lignes.size(), MPI_DOUBLE, MPI_MAX, communicator));
I get
Assertion failed in file ../../src/mpid/ch4/src/intel/ch4_shm_coll.c
at line 2266: comm->shm_numa_layout[my_numa_node].base_addr
/.../oneapi/2021.4.0/mpi/2021.4.0/lib/release/libmpi.so.12(MPL_backtrace_show+0x1c) [0x2ace34033c8c]
/Cci/Admin/oneapi/2021.4.0/mpi/2021.4.0/lib/release/libmpi.so.12(MPIR_Assert_fail+0x21)
[0x2ace33aaffe1]
/.../oneapi/2021.4.0/mpi/2021.4.0/lib/release/libmpi.so.12(+0x24f609)
[0x2ace337c6609]
/.../oneapi/2021.4.0/mpi/2021.4.0/lib/release/libmpi.so.12(+0x19b518)
[0x2ace33712518]
/.../oneapi/2021.4.0/mpi/2021.4.0/lib/release/libmpi.so.12(+0x1686aa)
[0x2ace336df6aa]
/.../oneapi/2021.4.0/mpi/2021.4.0/lib/release/libmpi.so.12(+0x251ac7)
[0x2ace337c8ac7]
/Cci/Admin/oneapi/2021.4.0/mpi/2021.4.0/lib/release/libmpi.so.12(PMPI_Allreduce+0x562)
[0x2ace33685712]
I only have this problem on big test cases. Meaning lots of communicators with a reasonnable amount of data to reduce. So I cannot create a MCVE, sorry.
When I set the environment variablesĀ I_MPI_COLL_DIRECT=off andĀ I_MPI_COLL_INTRANODE=pt2pt, the code works fine. Since I guess the problem is induced by the use of NUMA and I guess forcing point to point communication will inhibit the use of NUMA.
But my fear is that these options will lead to degraded performance, so I really would like to know the bottom problem.
I have tried with :
intel/2020.1.217
intel/2020.2.254
intel/2021.4.0
And they basically show the same error.
Could you tell me or give me a hint of what is going on ?
Thank you.

linux openssl simple client non-blocking bio over already existing socket, buffer, and epoll abstraction

Is there simple example code that shows how to create a non-blocking network-based bio from scratch? I do not need to verify right now or renegotiate, just get data flowing back and forth first.
I'm trying to use openssl on top of an already existing abstraction of epoll, sockets, and buffers. I already have existing machinery for all of that and am trying to create a BIO over it, but I cannot for the life of me get it to work. I already have an established TCP connection, so I need to insert that into a source/sink bio and then do the handshake.
The current state is the dreaded scenario where SSL_connect returns -1, SSL_get_error returns 5 (syscall error), and errno reads SUCCESS. I have seen others have the same problem, but not a single answer.
The reason for doing this instead of just using a mem bio to shuttle bytes to back forth is because the rest of the stack is fairly well optimized, and I don't want to do the extra copying.
My first idea what just to implement a bio over the in and out buffers I already maintain, but I cannot get that work either. There are a lot questions floating around this site and others. Some have outdated answers, but most just don't have answers that work when they do have answers at all.
Well, interesting. I would have made this a comment but it's too long.
I think this might boil down to the version of OpenSSL you are using. I took my existing blocking socket implementation and did this (sorry, it's a bit quick and dirty and doesn't bother to clear the error stack as it should, but that doesn't seem to be causing a problem. The busy loop is intentional, to hammer OpenSSL as hard as possible, but I also tried it with a short sleep and the result was the same):
u_long non_blocking = 1;
int ioctl_err = ioctlsocket (skt, FIONBIO, &non_blocking);
assert (ioctl_err == 0);
int connect_result = 0;
int connect_err = 0;
for ( ; ; )
{
connect_result = SSL_connect (ssl);
if (connect_result >= 0)
break;
connect_err = SSL_get_error (ssl, connect_result);
if (connect_err != SSL_ERROR_WANT_READ && connect_err != SSL_ERROR_WANT_WRITE)
break; // I put a breakpoint here; it was never hit
}
u_long non_blocking = 0; // so that the rest of my code still works
ioctl_err = ioctlsocket (skt, FIONBIO, &non_blocking);
if (connect_result <= 0) // this never happened either
do_something_appropriate ();
// Various (synchronous) calls to `SSL_read` and `SSL_write`, these all worked fine
Now I know this isn't exactly what you are doing, but from OpenSSL's point of view that shouldn't matter. So, for me, I can get it to work.
Testing environment:
Windows (sorry!)
OpenSSL 3.0.1 (which, if not the latest, is not far off)
tl;dr If you're using the OpenSSL libraries that came with your Linux installation, it might be time to move on. Plus, if you build it from source you can build it with debug info, which might come in handy some time (I've actually built two versions for exactly this reason - one optimised and one not).
So that's it. HTH. Looks like you might be OK after all.
PS: I'm obviously not doing anything with BIOs here, and maybe your problem lies there instead. If so, we need some self-contained sample code using those which exhibits the problems you are having. Then, perhaps, someone can suggest a solution.
Define BIO method:
BIO_meth_new
BIO_meth_set_write_ex
BIO_meth_set_read_ex
BIO_meth_set_ctrl
BIO_meth_free
Create BIO:
BIO_new
BIO_set_data
BIO_free
Associate BIO with SSL:
SSL_set_bio

I need help figuring out tcp sockets (clsocket)

I am having trouble figuring out sockets i am just asking the server for data at a position (glm::i64vec4) and expecting a response but the position gets way off when i get the response and the data for that position reflects that (aka my voxel game make a kinda cool looking but useless mess)
It's probably just me not understanding sockets whatsoever or maybe something weird with this library
one thought i had is it was maybe something to do with mismatching blocking and non blocking on the server and client
but when i switched the server to blocking (and put each client in a seperate thread from each other and the accepting process) it did nothing
if i'm doing something really stupid please tell me i know next to nothing about sockets
here is some code that probably looks horrible
Server Code
std::deque <CActiveSocket*> clients;
CPassiveSocket socket;
socket.Initialize();
socket.SetNonblocking();//I'm doing this so i don't need multiple threads for clients
socket.Listen("0.0.0.0",port);
while (1){
{
CActiveSocket* c;
if ((c = socket.Accept()) != NULL){
clients.emplace_back(c);
}
}
for (CActiveSocket*& c : clients){
c->Receive(sizeof(glm::i64vec4));
if (c->GetBytesReceived() == sizeof(glm::i64vec4)){
chkpkt chk;
chk.pos = *(glm::i64vec4*)c->GetData();
LOOP3D(chksize+2){
chk.data(i,j,k).val = chk.pos.y*chksize+j;
chk.data(i,j,k).id=0;
}
while (c->Send((uint8*)&chk,sizeof(chkpkt)) != sizeof(chkpkt)){}
}
}
}
Client Code
//v is a glm::i64vec4
//fsock is set to Blocking
if(fsock.Send((uint8*)&v,sizeof(glm::i64vec4)))
if (fsock.Receive(sizeof(chkpkt))){
tthread::lock_guard<tthread::fast_mutex> lock(wld->filemut);
wld->ichks[v]=(*(chkpkt*)fsock.GetData()).data;//i tried using the position i get back from the server to set this (instead of v) but that made it to where nothing loaded
//i checked it and the chunks position never lines up with what i sent
}
Without your complete application codes it's unrealistic to offer any suggestions of any particular lines of code correction.
But it seems like you are using this library. It doesn't matter if not, because most of time when doing network programming, socket's weird behavior make some problems somewhat universal. Thus there are a few suggestions for the portion of socket application in your project:
It suffices to have BLOCKING sockets.
Most of time socket's read have somewhat weird behavior, that is, it might not receive the requested size of bytes at a time. Due to this, you need to repeatedly call read until the receiving buffer is read thoroughly. For a complete and robust solution you can refer to Stevens's readn routine ([Ref.1], page122).
If you are using exactly the library mentioned above, you can see that your fsock.Receive eventually calls recv. And recv is just an variant of read[Ref.2], thus the solutions for both of them are just identical. And this pattern might help:
while(fsock.Receive(sizeof(chkpkt))>0)
{
// ...
}
Ref.1: https://mathcs.clarku.edu/~jbreecher/cs280/UNIX%20Network%20Programming(Volume1,3rd).pdf
Ref.2: https://man7.org/linux/man-pages/man2/recv.2.html#DESCRIPTION

Of these 3 methods for reading linked lists from shared memory, why is the 3rd fastest?

I have a 'server' program that updates many linked lists in shared memory in response to external events. I want client programs to notice an update on any of the lists as quickly as possible (lowest latency). The server marks a linked list's node's state_ as FILLED once its data is filled in and its next pointer has been set to a valid location. Until then, its state_ is NOT_FILLED_YET. I am using memory barriers to make sure that clients don't see the state_ as FILLED before the data within is actually ready (and it seems to work, I never see corrupt data). Also, state_ is volatile to be sure the compiler doesn't lift the client's checking of it out of loops.
Keeping the server code exactly the same, I've come up with 3 different methods for the client to scan the linked lists for changes. The question is: Why is the 3rd method fastest?
Method 1: Round robin over all the linked lists (called 'channels') continuously, looking to see if any nodes have changed to 'FILLED':
void method_one()
{
std::vector<Data*> channel_cursors;
for(ChannelList::iterator i = channel_list.begin(); i != channel_list.end(); ++i)
{
Data* current_item = static_cast<Data*>(i->get(segment)->tail_.get(segment));
channel_cursors.push_back(current_item);
}
while(true)
{
for(std::size_t i = 0; i < channel_list.size(); ++i)
{
Data* current_item = channel_cursors[i];
ACQUIRE_MEMORY_BARRIER;
if(current_item->state_ == NOT_FILLED_YET) {
continue;
}
log_latency(current_item->tv_sec_, current_item->tv_usec_);
channel_cursors[i] = static_cast<Data*>(current_item->next_.get(segment));
}
}
}
Method 1 gave very low latency when then number of channels was small. But when the number of channels grew (250K+) it became very slow because of looping over all the channels. So I tried...
Method 2: Give each linked list an ID. Keep a separate 'update list' to the side. Every time one of the linked lists is updated, push its ID on to the update list. Now we just need to monitor the single update list, and check the IDs we get from it.
void method_two()
{
std::vector<Data*> channel_cursors;
for(ChannelList::iterator i = channel_list.begin(); i != channel_list.end(); ++i)
{
Data* current_item = static_cast<Data*>(i->get(segment)->tail_.get(segment));
channel_cursors.push_back(current_item);
}
UpdateID* update_cursor = static_cast<UpdateID*>(update_channel.tail_.get(segment));
while(true)
{
ACQUIRE_MEMORY_BARRIER;
if(update_cursor->state_ == NOT_FILLED_YET) {
continue;
}
::uint32_t update_id = update_cursor->list_id_;
Data* current_item = channel_cursors[update_id];
if(current_item->state_ == NOT_FILLED_YET) {
std::cerr << "This should never print." << std::endl; // it doesn't
continue;
}
log_latency(current_item->tv_sec_, current_item->tv_usec_);
channel_cursors[update_id] = static_cast<Data*>(current_item->next_.get(segment));
update_cursor = static_cast<UpdateID*>(update_cursor->next_.get(segment));
}
}
Method 2 gave TERRIBLE latency. Whereas Method 1 might give under 10us latency, Method 2 would inexplicably often given 8ms latency! Using gettimeofday it appears that the change in update_cursor->state_ was very slow to propogate from the server's view to the client's (I'm on a multicore box, so I assume the delay is due to cache). So I tried a hybrid approach...
Method 3: Keep the update list. But loop over all the channels continuously, and within each iteration check if the update list has updated. If it has, go with the number pushed onto it. If it hasn't, check the channel we've currently iterated to.
void method_three()
{
std::vector<Data*> channel_cursors;
for(ChannelList::iterator i = channel_list.begin(); i != channel_list.end(); ++i)
{
Data* current_item = static_cast<Data*>(i->get(segment)->tail_.get(segment));
channel_cursors.push_back(current_item);
}
UpdateID* update_cursor = static_cast<UpdateID*>(update_channel.tail_.get(segment));
while(true)
{
for(std::size_t i = 0; i < channel_list.size(); ++i)
{
std::size_t idx = i;
ACQUIRE_MEMORY_BARRIER;
if(update_cursor->state_ != NOT_FILLED_YET) {
//std::cerr << "Found via update" << std::endl;
i--;
idx = update_cursor->list_id_;
update_cursor = static_cast<UpdateID*>(update_cursor->next_.get(segment));
}
Data* current_item = channel_cursors[idx];
ACQUIRE_MEMORY_BARRIER;
if(current_item->state_ == NOT_FILLED_YET) {
continue;
}
found_an_update = true;
log_latency(current_item->tv_sec_, current_item->tv_usec_);
channel_cursors[idx] = static_cast<Data*>(current_item->next_.get(segment));
}
}
}
The latency of this method was as good as Method 1, but scaled to large numbers of channels. The problem is, I have no clue why. Just to throw a wrench in things: if I uncomment the 'found via update' part, it prints between EVERY LATENCY LOG MESSAGE. Which means things are only ever found on the update list! So I don't understand how this method can be faster than method 2.
The full, compilable code (requires GCC and boost-1.41) that generates random strings as test data is at: http://pastebin.com/0kuzm3Uf
Update: All 3 methods are effectively spinlocking until an update occurs. The difference is in how long it takes them to notice the update has occurred. They all continuously tax the processor, so that doesn't explain the speed difference. I'm testing on a 4-core machine with nothing else running, so the server and the client have nothing to compete with. I've even made a version of the code where updates signal a condition and have clients wait on the condition -- it didn't help the latency of any of the methods.
Update2: Despite there being 3 methods, I've only tried 1 at a time, so only 1 server and 1 client are competing for the state_ member.
Hypothesis: Method 2 is somehow blocking the update from getting written by the server.
One of the things you can hammer, besides the processor cores themselves, is your coherent cache. When you read a value on a given core, the L1 cache on that core has to acquire read access to that cache line, which means it needs to invalidate the write access to that line that any other cache has. And vice versa to write a value. So this means that you're continually ping-ponging the cache line back and forth between a "write" state (on the server-core's cache) and a "read" state (in the caches of all the client cores).
The intricacies of x86 cache performance are not something I am entirely familiar with, but it seems entirely plausible (at least in theory) that what you're doing by having three different threads hammering this one memory location as hard as they can with read-access requests is approximately creating a denial-of-service attack on the server preventing it from writing to that cache line for a few milliseconds on occasion.
You may be able to do an experiment to detect this by looking at how long it takes for the server to actually write the value into the update list, and see if there's a delay there corresponding to the latency.
You might also be able to try an experiment of removing cache from the equation, by running everything on a single core so the client and server threads are pulling things out of the same L1 cache.
I don't know if you have ever read the Concurrency columns from Herb Sutter. They are quite interesting, especially when you get into the cache issues.
Indeed the Method2 seems better here because the id being smaller than the data in general would mean that you don't have to do round-trips to the main memory too often (which is taxing).
However, what can actually happen is that you have such a line of cache:
Line of cache = [ID1, ID2, ID3, ID4, ...]
^ ^
client server
Which then creates contention.
Here is Herb Sutter's article: Eliminate False Sharing. The basic idea is simply to artificially inflate your ID in the list so that it occupies one line of cache entirely.
Check out the other articles in the serie while you're at it. Perhaps you'll get some ideas. There's a nice lock-free circular buffer I think that could help for your update list :)
I've noticed in both method 1 and method 3 you have a line, ACQUIRE_MEMORY_BARRIER, which I assume has something to do with multi-threading/race conditions?
Either way, method 2 doesn't have any sleeps which means the following code...
while(true)
{
if(update_cursor->state_ == NOT_FILLED_YET) {
continue;
}
is going to hammer the processor. The typical way to do this kind of producer/consumer task is to use some kind of semaphore to signal to the reader that the update list has changed. A search for producer/consumer multi threading should give you a large number of examples. The main idea here is that this allows the thread to go to sleep while it's waiting for the update_cursor->state to change. This prevents this thread from stealing all the cpu cycles.
The answer was tricky to figure out, and to be fair would be hard with the information I presented though if anyone actually compiled the source code I provided they'd have a fighting chance ;) I said that "found via update list" was printed after every latency log message, but this wasn't actually true -- it was only true for as far as I could scrollback in my terminal. At the very beginning there were a slew of updates found without using the update list.
The issue is that between the time when I set my starting point in the update list and my starting point in each of the data lists, there is going to be some lag because these operations take time. Remember, the lists are growing the whole time this is going on. Consider the simplest case where I have 2 data lists, A and B. When I set my starting point in the update list there happen to be 60 elements in it, due to 30 updates on list A and 30 updates on list B. Say they've alternated:
A
B
A
B
A // and I start looking at the list here
B
But then after I set the update list to there, there are a slew of updates to B and no updates to A. Then I set my starting places in each of the data lists. My starting points for the data lists are going to be after that surge of updates, but my starting point in the update list is before that surge, so now I'm going to check for a bunch of updates without finding them. The mixed approach above works best because by iterating over all the elements when it can't find an update, it quickly closes the temporal gap between where the update list is and where the data lists are.

Win32 Overlapped Readfile on COM Port returning ERROR_OPERATION_ABORTED

Ok, one for the SO hive mind...
I have code which has - until today - run just fine on many systems and is deployed at many sites. It involves threads reading and writing data from a serial port.
Trying to check out a new device, my code was swamped with 995 ERROR_OPERATION_ABORTED errors calling GetOverlappedResult after the ReadFile. Sometimes the read would work, othertimes I'd get this error. Just ignoring the error and retrying would - amazingly - work without dropping any data. No ClearCommError required.
Here's the snippet.
if (!ReadFile(handle,&c,1,&read, &olap))
{
if (GetLastError() != ERROR_IO_PENDING)
{
logger().log_api(LOG_ERROR,"ser_rx_char:ReadFile");
throw Exception("ser_rx_char:ReadFile");
}
}
WaitForSingleObjectEx(r_event, INFINITE, true); // alertable, so, thread can be closed correctly.
if (GetOverlappedResult(handle,&olap,&read, TRUE) != 0)
{
if (read != 1)
throw Exception("ser_rx_char: no data");
logger().log(LOG_VERBOSE,"read char %d ( read = %d) ",c, read);
}
else
{
DWORD err = GetLastError();
if (err != 995) //Filters our ERROR_OPERATION_ABORTED
{
logger().log_api(LOG_ERROR,"ser_rx_char: GetOverlappedResult");
throw Exception("ser_rx_char:GetOverlappedResult");
}
}
My first guess is to blame the COM port driver, which I havent' used before (it's a RS422 port on a Blackmagic Decklink, FYI), but that feels like a cop-out.
Oh, and Vista SP1 Business 32-bit, for my sins.
Before I just put this down to "Someone else's problem", does anyone have any ideas of what might cause this?
How are you setting over the OVERLAPPED structure before the ReadFile? - I always zero them (other than the hEvent, obviously), which is perhaps part superstition, but I have a feeling that it's caused me a problem in the past.
I'm afraid blaming the driver (if it's non-MS and not just a tiny tweak from the reference) is not completely unrealistic. To write a COM driver is an incredibly complex thing, and the difficulty with testing it is that every application ever written uses the serial ports and their IOCTLs slightly differently.
Another common problem is not to set the whole port up - for example not calling SetCommTimeouts or SetupComm. I've no idea if you're making this sort of mistake, but I have met people who say they're not using timeouts when they actually mean that they didn't call SetCommTimeouts so they're using them but don't have a notion what they're set to...
This kind of stuff can be murder for 3rd-party COM drivers, because people have often got away with any old crap with the MS driver, and it doesn't always work the same with another device.
in addition to zeroing the OVERLAPPED, you might also check how you're setting olap.hEvent, that is, what are your arguments to CreateEvent? If you're creating an event that's pre-signalled (i.e. the third argument to CreateEvent is TRUE) I would expect an immediate return. Also, don't forget that if you specify manualReset (the second argument to CreateEvent) as FALSE, GetOverlappedResult() will helpfully clear the event for you - which might explain why it works the second time around.
Can't really tell from your snippet whether either of these affect you - hope this helps.