Chaining trailer segment to indirect mbuf DPAA2 PMD - dpdk

My DPDK application must fragment IPv4 packets before appending a trailer and transmitting the fragments with their trailers. Ideally it should do this with zero copies of the packet body.
For each fragment generated during fragmentation, the DPDK IPv4 fragmentation function creates:
One direct mbuf containing a new IPv4 header.
One indirect mbuf that references the original (pre-fragmentation) packet data.
The indirect mbuf is chained to the direct mbuf. As I can't change the original data to insert the trailer, I think I should insert a third segment at the end of the chain, after the indirect mbuf, to contain the trailer data (and that it's OK to do this in DPDK).
This results in an assertion failure as described below.
I'm using an NXP 2160 SoC (DPAA2 PMD and associated drivers). I've tried allocating and chaining the trailer data as follows (N.B. return value checks are omitted for brevity - they return OK)
struct rte_mbuf *fragment = fragments[i];
struct rte_mbuf *trailer = rte_pktmbuf_alloc(mbufpool_direct);
void *d = rte_pktmbuf_append(trailer, sizeof(some_trailer_data));
rte_memcpy(d, some_trailer_data, sizeof(some_trailer_data));
rte_pktmbuf_chain(pkt, trailer);
This seems to create the mbuf chain I think I need:
0 mbuf 0x17f47e700 next 0x17e512800 segs 3 data_len 20 ind=N (fragment header)
1 mbuf 0x17e512800 next 0x17f47af80 segs 1 data_len 1176 ind=Y (indirect packet payload)
2 mbuf 0x17f47af80 next (nil) segs 1 data_len 10 ind=N (trailer)
However, the code panics in the __rte_mbuf_raw_sanity_check function during a subsequent (much later) call to rte_pktmbuf_alloc. This panic only occurs when I append the trailer as shown above. The assertion and backtrace seems to indicate that a mbuf retrieved from the mempool for allocation has the next field set:
PANIC in __rte_mbuf_raw_sanity_check():
line 569 assert "m->next == ((void *)0)" failed
(gdb) bt
#0 __GI_raise (sig=sig#entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x0000fffff7a9baac in __GI_abort () at abort.c:79
#2 0x0000fffff7cb2be4 in __rte_panic () from /usr/local/lib/aarch64-linux-gnu/librte_eal.so.23
#3 0x0000aaaaaaaac804 in __rte_mbuf_raw_sanity_check (m=0x17f47e700) at /usr/local/include/rte_mbuf.h:569
#4 rte_mbuf_raw_alloc (mp=0x17f817800) at /usr/local/include/rte_mbuf.h:602
#5 0x0000aaaaaaaac9bc in rte_pktmbuf_alloc (mp=0x17f817800) at /usr/local/include/rte_mbuf.h:908
#6 0x0000aaaaaaab06ac in add_trailer (csa_idx=1234, pkt=0x17f478140, cxt=0xaaaaaaae4440 <appcxt+128>)
I'm trying to understand whether my method for adding a trailer is incorrect / unsupported or whether it may be an issue with the DPAA2 PMD.

Related

rte_eth_tx_burst() descriptor/mbuf management guarantees vs. free thresholds

The rte_eth_tx_burst() function is documented as:
* It is the responsibility of the rte_eth_tx_burst() function to
* transparently free the memory buffers of packets previously sent.
* This feature is driven by the *tx_free_thresh* value supplied to the
* rte_eth_dev_configure() function at device configuration time.
* When the number of free TX descriptors drops below this threshold, the
* rte_eth_tx_burst() function must [attempt to] free the *rte_mbuf* buffers
* of those packets whose transmission was effectively completed.
I have a small test program where this doesn't seem to hold true (when using the ixgbe driver on a vfio X553 1GbE NIC).
So my program sets up one transmit queue like this:
uint16_t tx_ring_size = 1024-32;
rte_eth_dev_configure(port_id, 0, 1, &port_conf);
r = rte_eth_dev_adjust_nb_rx_tx_desc(port_id, &rx_ring_size, &tx_ring_size);
struct rte_eth_txconf txconf = dev_info.default_txconf;
r = rte_eth_tx_queue_setup(port_id, 0, tx_ring_size,
rte_eth_dev_socket_id(port_id), &txconf);
The transmit mbuf packet pool is created like this:
struct rte_mempool *pkt_pool = rte_pktmbuf_pool_create("pkt_pool", 1023, 341, 0,
RTE_MBUF_DEFAULT_BUF_SIZE, rte_socket_id());
In that way, when sending packets I rather run out of TX descriptors before I run out of packet buffers. (the program generates packets with just one segment)
My expectation is that when I call rte_eth_tx_burst() in a loop (to send one packet after another) that it never fails since it transparently frees mbufs of already sent packets.
However, this doesn't happen.
I basically have a transmit loop like this:
for (unsigned i = 0; i < 2048; ++i) {
struct rte_mbuf *pkt = rte_pktmbuf_alloc(args.pkt_pool);
// error check, prepare packet etc.
uint16_t l = rte_eth_tx_burst(args.port_id, 0, &pkt, 1);
// error check etc.
}
After 1086 transmitted packets (of ~ 300 bytes each), rte_eth_tx_burst() returns 0.
I use the default threshold values, i.e. the queried values are (from dev_info.default_txconf):
tx thresh : 32
tx rs thresh: 32
wthresh : 0
So the main question now is: How hard is rte_eth_tx_burst() supposed to try to free mbuf buffers (and thus descriptors)?
I mean, it could busy loop until the transmission of previously supplied mbufs is completed.
Or it could just quickly check if some descriptors are free again. But if not, just give up.
Related question: Are the default threshold values appropriate for this use case?
So I work around this like that:
for (;;) {
uint16_t l = rte_eth_tx_burst(args.port_id, 0, &pkt, 1);
if (l == 1) {
break;
} else {
RTE_LOG(ERR, USER1, "cannot send packet\n");
int r = rte_eth_tx_done_cleanup(args.port_id, 0, 256);
if (r < 0) {
rte_panic("%u. cannot cleanup tx descs: %s\n", i, rte_strerror(-r));
}
RTE_LOG(WARNING, USER1, "%u. cleaned up %d descriptors ...\n", i, r);
}
}
With that I get output like this:
USER1: cannot send packet
USER1: 1086. cleaned up 32 descriptors ...
USER1: cannot send packet
USER1: 1118. cleaned up 32 descriptors ...
USER1: cannot send packet
USER1: 1150. cleaned up 0 descriptors ...
USER1: cannot send packet
USER1: 1182. cleaned up 0 descriptors ...
[..]
USER1: cannot send packet
USER1: 1950. cleaned up 32 descriptors ...
USER1: cannot send packet
USER1: 1982. cleaned up 0 descriptors ...
USER1: cannot send packet
USER1: 2014. cleaned up 0 descriptors ...
USER1: cannot send packet
USER1: 2014. cleaned up 32 descriptors ...
USER1: cannot send packet
USER1: 2046. cleaned up 32 descriptors ...
Meaning that it frees at most 32 descriptors like this. And that it doesn't always succeed, but then the next rte_eth_tx_burst() succeeds freeing some.
Side question: Is there a better more dpdk-idiomatic way to handle the recycling of mbufs?
When I change the code such that I run out of mbufs before I run out of transmit descriptors (i.e. tx ring created with 1024 descriptors, mbuf pool still has 1023 elements), I have to change the alloc part like this:
struct rte_mbuf *pkt;
do {
pkt = rte_pktmbuf_alloc(args.pkt_pool);
if (!pkt) {
r = rte_eth_tx_done_cleanup(args.port_id, 0, 256);
if (r < 0) {
rte_panic("%u. cannot cleanup tx descs: %s\n", i, rte_strerror(-r));
}
RTE_LOG(WARNING, USER1, "%u. cleaned up %d descriptors ...\n", i, r);
}
} while (!pkt);
The output is similar, e.g.:
USER1: 1023. cleaned up 95 descriptors ...
USER1: 1118. cleaned up 32 descriptors ...
USER1: 1150. cleaned up 32 descriptors ...
USER1: 1182. cleaned up 32 descriptors ...
USER1: 1214. cleaned up 0 descriptors ...
USER1: 1214. cleaned up 0 descriptors ...
USER1: 1214. cleaned up 32 descriptors ...
[..]
That means the freeing of descriptors/mbufs is so 'slow' that it has to busy loop up to 3 times.
Again, is this a valid approach, or are there better dpdk ways to solve this?
Since rte_eth_tx_done_cleanup() might return -ENOTSUP, this may point to the direction that my usage of it might not be the best solution.
Incidentally, even with the ixgbe driver it fails for me when I disable checksum offloads!
Apparently, ixgbe_dev_tx_done_cleanup() then invokes ixgbe_tx_done_cleanup_vec() instead of ixgbe_tx_done_cleanup_full() which unconditionally returns -ENOTSUP:
static int
ixgbe_tx_done_cleanup_vec(struct ixgbe_tx_queue *txq __rte_unused,
uint32_t free_cnt __rte_unused)
{
return -ENOTSUP;
}
Does this make sense?
So then perhaps the better strategy is then to make sure that there are less descriptors than pool elements (e.g. 1024-32 < 1023) and just re-call rte_eth_tx_burst() until it returns one?
That means like this:
for (;;) {
uint16_t l = rte_eth_tx_burst(args.port_id, 0, &pkt, 1);
if (l == 1) {
break;
} else {
RTE_LOG(ERR, USER1, "%u. cannot send packet - retry\n", i);
}
}
This works, and the output shows again that the descriptors are freed 32 at a time, e.g.:
USER1: 1951. cannot send packet - retry
USER1: 1951. cannot send packet - retry
USER1: 1983. cannot send packet - retry
USER1: 1983. cannot send packet - retry
USER1: 2015. cannot send packet - retry
USER1: 2015. cannot send packet - retry
USER1: 2047. cannot send packet - retry
USER1: 2047. cannot send packet - retry
I know that I also can use rte_eth_tx_burst() to submit bigger bursts. But I want to get the simple/edge cases right and understand the dpdk semantics, first.
I'm on Fedora 33 and DPDK 20.11.2.
Recommendation/Solution: after analyzing the cause of the issue is indeed with TX descriptor with either rte_mempool_list_dump or dpdk-procinfo, please use rte_eth_tx_buffer_flush or change the settings for TX thresholds.
Explanation:
The behaviour mbuf_free is varied across PMD, and within the same NIC PF and VF also varies. Follow are some points to understand this propely
rte_mempool can be created with or without cache elements.
when created with cached elements, depending upon the available lcores (eal_options) and number of cache elements per core parameter, the configured mbufs are added per core cache.
When HW offload DEV_TX_OFFLOAD_MBUF_FAST_FREE is available and enabled, the agreement is the mbuf will have ref_cnt as 1.
So when ever tx_burst (success or failure is invoked) threshold levels are checked if free mbuf/mbuf-segments can be pushed back to pool.
With DEV_TX_OFFLOAD_MBUF_FAST_FREE enabled the driver blindly puts the elements into lcore cache.
while in case of no DEV_TX_OFFLOAD_MBUF_FAST_FREE, generic approach of validating the MBUF ensuring the nb_segments and ref_cnt are checked, then pushed to mempool.
But always the either fixed (32 I believe is the default set for all PMD) or available free mbuf is pushed to cache or pool always.
Facts:
In the case of the IXGBE VF driver the option DEV_TX_OFFLOAD_MBUF_FAST_FREE is not available. Which means each time whenever thresholds are met, each individual mbuf are checked and pushed to the mempool.
as per the code snippet rte_eth_dev_configure is configured only for TX, and rte_pktmbuf_pool_create is created to have 341 elements as cache.
Assumption has to be made, that there is only 1 Lcore based (which runs the loop of alloc and tx).
Code Snippet-1:
for (unsigned i = 0; i < 2048; ++i) {
struct rte_mbuf *pkt = rte_pktmbuf_alloc(args.pkt_pool);
// error check, prepare packet etc.
uint16_t l = rte_eth_tx_burst(args.port_id, 0, &pkt, 1);
// error check etc.
}
After 1086 transmitted packets (of ~ 300 bytes each), rte_eth_tx_burst() returns 0.
[Observation] If indeed the mbuf were running, the rte_pktmbuf_alloc should be failing before rte_eth_tx_burst. But failing at 1086, creates an interesting phenomenon because total mbuf created is 1023, and failure happens are 2 iteration of 32 mbuf_release to mempool. Analyzing the driver code for ixgbe, it can be found that (only place return as 0) in tx_xmit_pkts is
/* Only use descriptors that are available */
nb_pkts = (uint16_t)RTE_MIN(txq->nb_tx_free, nb_pkts);
if (unlikely(nb_pkts == 0))
return 0;
Even though in config tx_ring_size is set to 992, internally rte_eth_dev_adjust_nb_desc sets to max of *nb_desc, desc_lim->nb_min. Based on the code it is not because there are no free mbuf, but it due to TX descriptor is low or not availble.
while in all other cases, whenever rte_eth_tx_done_cleanup or rte_eth_tx_buffer_flush these actually pushes any pending descriptors to be DMA immediately out of SW PMD. This internally frees up more descriptors which makes the tx_burst much smoother.
To identify the root cause, whenever DPDK API tx_burst return either
invoke rte_mempool_list_dump or
make use of mempool dump via dpdk-procinfo
Note: most PMD operates on amortizing the cost of the descriptor (PCIe payload) write by batching and bunching for at least 4 (in case of SSE). Hence a single packet even if DPDK tx_burst returning 1 will not be pushing the packet out of NIC. Hence to ensure use rte_eth_tx_buffer_flush.
Say, you invoke rte_eth_tx_burst() to send one small packet (single mbuf, no offloads). Suppose, the driver indeed pushes the packet to the HW. Doing so eats up one descriptor in the ring: the driver "remembers" that this packet mbuf is associated with that descriptor. But the packet is not sent instantly. The HW typically has some means to notify the driver of completions. Just imagine: if the driver checked for completions on every rte_eth_tx_burst() invocation (thus ignoring any thresholds), then calling rte_eth_tx_burst() one more time in a tight loop manner for another packet would likely consume one more descriptor rather than recycle the first one. So, given this fact, I'd not use tight loop when investigating tx_free_thresh semantics. And it shouldn't matter whether you invoke rte_eth_tx_burst() once per a packet or once per a batch of them.
Now. Say, you have a Tx ring of size N. Suppose, tx_free_thresh is M. And you have a mempool of size Z. What you do is allocate a burst of N - M - 1 small packets and invoke rte_eth_tx_burst() to send this burst (no offloads; each packet is assumed to eat up one Tx descriptor). Then you wait for some wittingly sufficient (for completions) amount of time and check the number of free objects in the mempool. This figure should read Z - (N - M - 1). Then you allocate and send one extra packet. Then wait again. This time, the number of spare objects in the mempool should read Z - (N - M). Finally, you allocate and send one more packet (again!) thus crossing the threshold (the number of spare Tx descriptors becomes less than M). During this invocation of rte_eth_tx_burst(), the driver should detect crossing the threshold and start checking for completions. This should make the driver free (N - M) descriptors (consumed by two previous rte_eth_tx_burst() invocations) thus clearing up the whole ring. Then the driver proceeds to push the new packet in question to the HW thus spending one descriptor. You then check the mempool: this should report Z - 1 free objects.
So, the short of it: no loop, just three rte_eth_tx_burst() invocations with sufficient waiting time between them. And you check the spare object count in the mempool after each send operation. Theoretically, this way, you'll be able to understand the corner case semantics. That's the gist of it. However, please keep in mind that the actual behaviour may vary across different vendors / PMDs.
Relying on rte_eth_tx_done_cleanup() really isn't an option since many PMDs don't implement it. Mostly Intel PMD's provide it, but e.g. SFC, MLX* and af_packet ones don't.
However, it's still unclear why the ixgbe PMD doesn't support cleanup when no offloads are enabled.
The requirements on rte_eth_tx_burst() with respect to freeing are really light - from the API docs:
* It is the responsibility of the rte_eth_tx_burst() function to
* transparently free the memory buffers of packets previously sent.
* This feature is driven by the *tx_free_thresh* value supplied to the
* rte_eth_dev_configure() function at device configuration time.
* When the number of free TX descriptors drops below this threshold, the
* rte_eth_tx_burst() function must [attempt to] free the *rte_mbuf* buffers
* of those packets whose transmission was effectively completed.
[..]
* #return
* The number of output packets actually stored in transmit descriptors of
* the transmit ring. The return value can be less than the value of the
* *tx_pkts* parameter when the transmit ring is full or has been filled up.
So just attempting to free (but not waiting on the results of that attempt) and returning 0 (since 0 is less than tx_pkts) is covered by that 'contract'.
FWIW, no example distributed with dpdk loops around rte_eth_tx_burst() to re-submit not-yet-sent packages. There are some examples that use rte_eth_tx_burst() and discard unsent packages, though.
AFAICS, besides rte_eth_tx_done_cleanup() and rte_eth_tx_burst() there is no other function for requesting the release of mbufs previously submitted for transmission.
Thus, it's advisable to size the mbuf packet pool larger than the configured ring size in order to survive situations where all mbufs are inflight and can't be recovered because there is no mbuf left for calling rte_eth_tx_burst() again.

Do I have to wrap L2 packets in mbufs? If yes, then how?

I have L2 packets in form of byte arrays and I would like to send them to the internet using DPDK. The L2 packets are inside a secondary DPDK app and the plan is to send them to a primary DPDK app.
I thought of the following (inside the secondary app):
get an mbuf from the shared mempool
put the L2 packet inside the mbuf
put the mbuf on the shared ring for the primary to take
The primary DPDK will take them from the shared ring and send them using rte_tx_eth_burst()
My problem is that I am not sure if I should wrap the L2 packets in mbufs or not.
And if I should, then how could I do it?
What I got so far is this:
let mut my_buffer = self.do_rte_mempool_get();
while let Err(er) = my_buffer {
warn!("rte_mempool_get failed, trying again.");
my_buffer = self.do_rte_mempool_get();
// it may fail if not enough entries are available.
}
warn!("rte_mempool_get success");
//STEP ONE UNTIL HERE
// Let's just send an empty packet for starters.
let my_buffer = my_buffer.unwrap();
// STEP TWO MISSING
let mut res = self.do_rte_ring_enqueue(my_buffer);
// it may fail if not enough room in the ring to enqueue
while let Err(er) = res {
warn!("rte_ring_enqueue failed, trying again.");
res = self.do_rte_ring_enqueue(my_buffer);
}
warn!("rte_ring_enqueue success");
// STEP THREE UNTIL HERE
This is Rust code, I created wrappers for the C bindings to DPDK.
There are multiple ways to send the desired packet out from the DPDK interface. Such as
Send the packet from Primary using rte_eth_tx_burst
Send the packet from Secondary using rte_eth_tx_burst
Sharing the packet (byte) array from secondary to primary
Sharing the fully constructed mbuf from secondary to primary
The actual question is My problem is that I am not sure if I should wrap the L2 packets in mbufs or not. And if I should, then how could I do it? and not How to send L2 packets to the internet using DPDK?
The answer is depending upon the use case one can either send the byte array or complete formulated mbuf from secondary to primary.
Advantage of sending Byte array to primary:
no need to allocate from mbuf pool for mbuf instance
No need to copy to the specific location of mbuf data position
no need to update mbuf headers
Advantage of sending mbuf primary:
mbuf can be simply validated and send out directly via tx_burst
no need to synchronize or use different tx queues.
secondary application can be built with minimal DPDK libraries and with no driver PMD also.
hence depending upon the actual intend the decision can be made.

C++ udp recvfrom reduce drops

I have quite standard setup of my udp receiver socket. My sender sends data at 36Hz and receiver reads at 72Hz. 12072bytes per send.
When I do cat /proc/net/udp. I get usually
sl local_address rem_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode ref pointer drops
7017: 0101007F:0035 00000000:0000 07 00000000:00000000 00:00000000 00000000 0 0 10636 2 0000000000000000 0
7032: 00000000:0044 00000000:0000 07 00000000:00000000 00:00000000 00000000 0 0 14671 2 0000000000000000 0
7595: 00000000:0277 00000000:0000 07 00000000:00000000 00:00000000 00000000 0 0 11113 2 0000000000000000 0
7660: 00000000:22B8 00000000:0000 07 00000000:00004100 00:00000000 00000000 1000 0 251331 3 0000000000000000 352743
You can see the rx_queue has some values in there, reads not fast enough ?
My code
int recv_len = recvfrom(s, buf, BUFLEN, MSG_TRUNC, (struct sockaddr *) &si_other, &slen);
// dont worry buflen is like 64000 no error here
std::cout <<" recv_len "<<recv_len<<std::endl;
I always get output as recv_len 12072 even though the queue is quite big ? why is this ? Is there a way to speed up my read or read all the messages in the queue ? I dont understand what's wrong even my read frequency is higher.
UDP datagrams always travel as a complete atomic unit. If you send a 12072 byte UDP datagram, your receiver will get exactly one 12072 byte datagram or nothing at all -- you won't ever receive a partial message (*) or multiple messages concatenated.
Note that with datagrams of this size, they're almost certainly being fragmented at the IP layer because they're probably larger than your network's MTU (maximum transmission unit). In that case, if any one of the fragments is dropped along the way or at the receiving host or found to be corrupted, the entire UDP datagram will be dropped.
(* A message may be truncated if the buffer provided to recvfrom is too small, but it will never even be considered for receiving if the entire message could not be reassembled in the kernel.)
If you are unable to receive all the messages being sent, I would check whether you need to increase the kernel buffer space allocated to UDP. This is done with the sysctl utility. Specifically you should check and possibly adjust the values of net.core.rmem_max and net.ipv4.udp_mem. See the corresponding documentation:
https://www.kernel.org/doc/Documentation/sysctl/net.txt
https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt
Finally, it seems a bit curious to talk about "read frequency" -- I assume that means you are polling the socket 72 times per second? Why not just dedicate a thread to reading from the socket. Then the thread can block on the recvfrom and the receive will complete with the least possible latency. (In any case, this is worth a try, even if only for a test -- to see if the polling is contributing to your inability to keep up with the sender.)

gdb - identify the reason of the segmentation fault

I have a server/client system that runs well on my machines. But it core dumps at one of the users machine (OS: Centos 5). Since I don't have access to the user's machine so I built a debug mode binary and asked the user to try it. The crash did happened again after around 2 days of running. And he sent me the core dump file. Loading the core dump file with gdb, it did shows the crash location but I don't understand the reason (sorry, my previous experience is mostly with Windows. I don't have much experience with Linux/gdb). I would like have your input. Thanks!
1. the /var/log/messages at the user's machine shows the segfault:
Jan 16 09:20:39 LPZ08945 kernel: LSystem[4688]: segfault at 0000000000000000 rip 00000000080e6433 rsp 00000000f2afd4e0 error 4
This message indicates that there is a segfault at instruction pointer 80e6433 and stack pointer f2afd4e0. Looks that the program tries to read/write at address 0.
2. load the core dump file into gdb and it shows the crash location:
$gdb LSystem core.19009
GNU gdb (GDB) CentOS (7.0.1-45.el5.centos)
... (many lines of outputs from gdb omitted)
Core was generated by `./LSystem'.
Program terminated with signal 11,
Segmentation fault.
'#0' 0x080e6433 in CLClient::connectToServer (this=0xf2afd898, conn=11) at liccomm/LClient.cpp:214
214 memcpy((char *) & (a4.sin_addr), pHost->h_addr, pHost->h_length);
gdb says the crash occurs at Line 214?
3. Frame information. (at Frame #0)
(gdb) info frame
Stack level 0, frame at 0xf2afd7e0:
eip = 0x80e6433 in CLClient::connectToServer (liccomm/LClient.cpp:214); saved eip 0x80e6701
called by frame at 0xf2afd820
source language c++.
Arglist at 0xf2afd7d8, args: this=0xf2afd898, conn=11
Locals at 0xf2afd7d8, Previous frame's sp is 0xf2afd7e0
Saved registers:
ebx at 0xf2afd7cc, ebp at 0xf2afd7d8, esi at 0xf2afd7d0, edi at 0xf2afd7d4, eip at 0xf2afd7dc
The frame is at f2afd7e0, why it's different than the rsp from Part 1, which is f2afd4e0? I guess the user may have provided me with mismatched core dump file (whose pid is 19009) and /var/log/messages file (which indicates a pid 4688).
4. The source
(gdb) list +
209
210 //pHost is declared as struct hostent* and 'pHost = gethostbyname(serverAddress);'
211 memset( &a4, 0, sizeof(a4) );
212 a4.sin_family = AF_INET;
213 a4.sin_port = htons( nPort );
214 memcpy((char *) & (a4.sin_addr), pHost->h_addr, pHost->h_length);
215
216 aalen = sizeof(a4);
217 aa = (struct sockaddr *)&a4;
I could not see anything wrong with Line 214. And this part of the code must ran many times during the runtime of 2 days.
5. The variables
Since gdb indicated that Line 214 was the culprit. I printed everything.
memcpy((char *) & (a4.sin_addr), pHost->h_addr, pHost->h_length);
(gdb) print a4.sin_addr
$1 = {s_addr = 0}
(gdb) print &(a4.sin_addr)
$2 = (in_addr *) 0xf2afd794
(gdb) print pHost->h_addr_list[0]
$3 = 0xa24af30 "\202}\204\250"
(gdb) print pHost->h_length
$4 = 4
(gdb) print memcpy
$5 = {} 0x2fcf90
So I basically printed everything that's at Line 214. ('pHost->h_addr_list[0]' is 'pHost->h_addr' due to '#define h_addr h_addr_list[0]')
I was not able to catch anything wrong. Did you catch anything fishy? Is it possible the memory has been corrupted somewhere else? I appreciate your help!
[edited] 6. back trace
(gdb) bt
'#0' 0x080e6433 in CLClient::connectToServer (this=0xf2afd898, conn=11) at liccomm/LClient.cpp:214
'#1' 0x080e6701 in CLClient::connectToLMServer (this=0xf2afd898) at liccomm/LClient.cpp:121
... (Frames 2~7 omitted, not relevant)
'#8' 0x080937f2 in handleConnectionStarter (par=0xf3563f98) at LManager.cpp:166
'#9' 0xf7f5fb41 in ?? ()
'#10' 0xf3563f98 in ?? ()
'#11' 0xf2aff31c in ?? ()
'#12' 0x00000000 in ?? ()
I followed the nested calls. They are correct.
The problem with the memcpy is that the source location is not of the same type than the destination.
You should use inet_addr to convert addresses from string to binary
a4.sin_addr = inet_addr(pHost->h_addr);
The previous code may not work depending on the implementation (some my return struct in_addr, others will return unsigned long, but the principle is the same.

Possible causes of a deadlock in socket select

I have a jabber server application an another jabber client application in C++.
When the client receive and send a lot of messages (more than 20 per second), this comes that the select just freeze and never return.
With netstat, the socket is still connected on linux and with tcpdump, the message is still send to the client but the select just never return.
Here is the code that select :
bool ConnectionTCPBase::dataAvailable( int timeout )
{
if( m_socket < 0 )
return true; // let recv() catch the closed fd
fd_set fds;
struct timeval tv;
FD_ZERO( &fds );
// the following causes a C4127 warning in VC++ Express 2008 and possibly other versions.
// however, the reason for the warning can't be fixed in gloox.
FD_SET( m_socket, &fds );
tv.tv_sec = timeout / 1000000;
tv.tv_usec = timeout % 1000000;
return ( ( select( m_socket + 1, &fds, 0, 0, timeout == -1 ? 0 : &tv ) > 0 )
&& FD_ISSET( m_socket, &fds ) != 0 );
}
And the deadlock is with gdb:
Thread 2 (Thread 0x7fe226ac2700 (LWP 10774)):
#0 0x00007fe224711ff3 in select () at ../sysdeps/unix/syscall-template.S:82
#1 0x00000000004706a9 in gloox::ConnectionTCPBase::dataAvailable (this=0xcaeb60, timeout=<value optimized out>) at connectiontcpbase.cpp:103
#2 0x000000000046c4cb in gloox::ConnectionTCPClient::recv (this=0xcaeb60, timeout=10) at connectiontcpclient.cpp:131
#3 0x0000000000471476 in gloox::ConnectionTLS::recv (this=0xd1a950, timeout=648813712) at connectiontls.cpp:89
#4 0x00000000004324cc in glooxd::C2S::recv (this=0xc5d120, timeout=10) at c2s.cpp:124
#5 0x0000000000435ced in glooxd::C2S::run (this=0xc5d120) at c2s.cpp:75
#6 0x000000000042d789 in CNetwork::run (this=0xc56df0) at src/Network.cpp:343
#7 0x000000000043115f in threading::ThreadManager::threadWorker (data=0xc56e10) at src/ThreadManager.cpp:15
#8 0x00007fe2249bc9ca in start_thread (arg=<value optimized out>) at pthread_create.c:300
#9 0x00007fe22471970d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#10 0x0000000000000000 in ?? ()
Do you know what can cause a select to stop receiving messages even if we are still sending to him.
Is there any buffer limit in linux when receiving and sending a lot of messages through the socket ?
Thanks
There are several possibilities.
Exceeding FD_SETSIZE
Your code is checking for a negative file descriptor, but not for exceeding the upper limit which is FD_SETSIZE (typically 1024). Whenever that happens, your code is
corrupting its own stack
presenting an empty fd_set to the select which will cause a hang
Supposing that you do not need so many concurrently open file descriptors, the solution would probably consist in finding a removing a file descriptor leak, especially the code up the stack that handles closing of abandoned descriptors.
There is a suspicious comment in your code that indicates a possible leak:
// let recv() catch the closed fd
If this comment means that somebody sets m_socket to -1 and hopes that a recv will catch the closed socket and close it, who knows, maybe we are closing -1 and not the real closed socket. (Note the difference between closing on network level and closing on file descriptor level which requires a separate close call.)
This could also be treated by moving to poll but there are a few other limits imposed by the operating system that make this route quite challenging.
Out of band data
You say that the server is "sending" data. If that means that the data is sent using the send call (as opposed to a write call), use strace to determine the send flags argument. If MSG_OOB flag is used, the data is arriving as out of band data - and your select call will not notice those until you pass a copy of fds as another parameter.
fd_set fds_copy = fds;
select( m_socket + 1, &fds, 0, &fds_copy, timeout == -1 ? 0 : &tv )
Process starvation
If the box is heavily overloaded, the server is executing without any blocking calls, and with a real time priority (use top to check on that) - and the client is not - the client might be starved.
Suspended process
The client might theoretically be stopped with a SIGSTOP. You would probably know if this is the case, having pressed somewhere ctrl-Z or having some particular process exercising control on the client other than you starting it yourself.