rte flow for arp packet with dpdk - dpdk

Is there a way using rte_flow to send arp and ndp packet to specific rx queue with dpdk
In the rte_flow_item_type I don’t see an entry for arp or ndp
For ipv4 I did the following way
pattern[0].type = RTE_FLOW_ITEM_TYPE_ETH;
pattern[0].spec = NULL;
pattern[1].type = RTE_FLOW_ITEM_TYPE_IPV4;
pattern[1].spec = NULL;
What do i ned to do for arp and ndp? There is no RTE_FLOW_ITEM_TYPE_ARP
Dpdk version: 19.11
NIC: mlx5, 100G mellanox card,

Since v18.05-rc1, there has been item type RTE_FLOW_ITEM_TYPE_ARP_ETH_IPV4. That being said, it might be unsupported by the PMD in question.
Consider matching on the EtherType field instead:
#include <rte_byteorder.h>
#include <rte_ether.h>
#include <rte_flow.h>
struct rte_flow_item_eth item_eth_mask = {};
struct rte_flow_item_eth item_eth_spec = {};
item_eth_spec.hdr.ether_type = RTE_BE16(RTE_ETHER_TYPE_ARP);
item_eth_mask.hdr.ether_type = RTE_BE16(0xFFFF);
pattern[0].type = RTE_FLOW_ITEM_TYPE_ETH;
pattern[0].mask = &item_eth_mask;
pattern[0].spec = &item_eth_spec;
In what comes to NDP, perhaps it pays to check out RTE_FLOW_ITEM_TYPE_ICMP6_ND_*. Again, these might be unsupported by the PMD in question. If that is the case, consider the use of RTE_FLOW_ITEM_TYPE_ICMP6 to redirect all of ICMPv6 to the dedicated queue.

The default queue (with RSS disabled) for no match packet for all vendor PMD are queue 0. Expect for TAP PMD where multiple RX queue is supported by OS.
Since you have mentioned the DPDK version is 19.11, the best option is to make use rte_flow_item_eth to filter desired ether type. That is in DPDK 19.11
struct rte_flow_item_eth {
struct rte_ether_addr dst; /**< Destination MAC. */
struct rte_ether_addr src; /**< Source MAC. */
rte_be16_t type; /**< EtherType or TPID. */
};
so using the following code snippet you can achieve packet type steering to desired queue
struct rte_flow_attr attr = { .ingress = 1 };
struct rte_flow_item pattern[10];
struct rte_flow_action actions[10];
struct rte_flow_action_queue actionqueue = { .index = 1 };
struct rte_flow_item_eth eth;
struct rte_flow_item_vlan vlan;
struct rte_flow_item_ipv4 ipv4;
struct rte_flow *flow;
struct rte_flow_error error;
struct rte_flow_item_eth item_eth_mask;
struct rte_flow_item_eth item_eth_spec;
/* memset item_eth_mask and item_eth_spec to 0 */
item_eth_spec.hdr.ether_type = RTE_BE16(RTE_ETHER_TYPE_ARP);
item_eth_mask.hdr.ether_type = 0xFFFF;
/* setting the eth to pass all packets */
pattern[0].type = RTE_FLOW_ITEM_TYPE_ETH;
pattern[0].spec = ð
pattern[0].last = &item_eth_mask;
pattern[0].mask = &item_eth_spec;
/* end the pattern array */
pattern[1].type = RTE_FLOW_ITEM_TYPE_END;
actions[0].type = RTE_FLOW_ACTION_TYPE_QUEUE;
actions[0].conf = &actionqueue;
actions[1].type = RTE_FLOW_ACTION_TYPE_END;
/* validate and create the flow rule */
if (!rte_flow_validate(port, &attr, pattern, actions, &error))
flow = rte_flow_create(port, &attr, pattern, actions, &error);
else
printf("rte_flow err %s\n", error.message);
note: this fails on Intel X710 and E810 NIC as it support VLAn+IP+UDP|TCP|SCTP+Inner.

Related

Checksum calculation changes in DPDK v.19.11?

Since upgrading from DPDK 19.08.2 to 19.11.8, UDP Rx packets are failing the IPv4 checksum calculation. We offload Tx checksum calculation to hardware, but on the Rx side we calculate checksum in software by calling rte_ipv4_cksum().
The NIC is a Intel X722 device.
If both Tx and Rx use DPDK 19.08.2, all is ok and rte_ipv4_cksum() returns 0xFFFF (as I expect).
If Tx uses DPDK 19.08.2 but Rx uses 19.11.8, rte_ipv4_cksum() returns 0 (which we count as a failure).
Could this be a bug or am I misunderstanding the checksum calculation?
I notice there is a difference in the return statement of rte_ipv4_cksum() for the two versions:
In 19.0.8:
static inline uint16_t
rte_ipv4_cksum(const struct rte_ipv4_hdr *ipv4_hdr)
{
uint16_t cksum;
cksum = rte_raw_cksum(ipv4_hdr, sizeof(struct rte_ipv4_hdr));
return (cksum == 0xffff) ? cksum : (uint16_t)~cksum;
}
In 19.11.8:
static inline uint16_t
rte_ipv4_cksum(const struct rte_ipv4_hdr *ipv4_hdr)
{
uint16_t cksum;
cksum = rte_raw_cksum(ipv4_hdr, sizeof(struct rte_ipv4_hdr));
return (uint16_t)~cksum;
}
The reason is that the return value of rte_ipv4_cksum(), for a valid checksum, has changed in DPDK 19.11.8.

I need code snippet to embed the ethernet header with payload from secondary DPDK application to primary

The architecture is as follows:
Secondary DPDK app ------> Primary DPDK App ----> (EDIT)Interface
Inside my Secondary I have a vector of u8 bytes representing an L2 packet.
I want to send this L2 packet to the Primary App so the primary could send it to the internet.
From what I understood, the L2 packet has to be wrapped in mbuf in order to be able to put on a shared ring.
But I have no clue on how to do this wrapping.
What I don't know exactly: my packet is just a vector of bytes, how could I extract useful information out of it in order to fill the mbuf fields? And which fields of the mbuf should be filled minimally for this to work?
For better understanding, here is what should happen step by step:
Vector of bytes gets in secondary (doesn't matter how)
Secondary gets an mbuf from the shared mempool.
Secondary puts the vector inside the mbuf (the vector is an L2 packet)
mbuf has many fields representing many things, so I don't know which field to fill and with what.
Secondary places the mbuf on a shared ring.
Primary grabs the mbuf from shared ring.
Primary send the mbuf to the internet.
This is what I coded so far, the secondary App is in Rust and primary App is in C language.
Secondary is here (github):
Remember, L2 Packet is just Vec, that is like [1, 222, 23, 34...], a simple array.
// GETTING THE MBUF FROM SHARED MEMPOOL
let mut my_buffer = self.do_rte_mempool_get();
while let Err(er) = my_buffer {
warn!("rte_mempool_get failed, trying again.");
my_buffer = self.do_rte_mempool_get();
// it may fail if not enough entries are available.
}
warn!("rte_mempool_get success");
// Let's just send an empty packet for starters.
let my_buffer = my_buffer.unwrap();
// HERE I SHOULD PUT THE L2 PACKET INSIDE THE MBUF.
// MY L2 PACKET is a Vec<u8>
// NOW I PUT THE MBUF ON THE SHARED RING, BYE MBUF
let mut res = self.do_rte_ring_enqueue(my_buffer);
// it may fail if not enough room in the ring to enqueue
while let Err(er) = res {
warn!("rte_ring_enqueue failed, trying again.");
res = self.do_rte_ring_enqueue(my_buffer);
}
warn!("rte_ring_enqueue success");
And Primary is here (it just gets mbufs from ring and has to send them with rte_eth_tx_burst()):
/* Run until the application is quit or killed. */
for (;;)
{
// receive packets on rte ring
// then send them to NIC
struct rte_mbuf *bufs[BURST_SIZE];
void *mbuf;
if (rte_ring_dequeue(recv_ring, &mbuf) < 0) {
continue;
}
printf("Received mbuf.\n");
//for now I just want to test it out so I stop here
continue;
//* Send packet to port */
bufs[0] = mbuf;
uint16_t nbPackets = 1;
const uint16_t nb_tx = rte_eth_tx_burst(port, 0,
bufs, nbPackets);
// /* Free any unsent packets. */
if (unlikely(nb_tx < nbPackets))
{
rte_pktmbuf_free(bufs[nbPackets]);
}
If you have any questions please let me know!
As always, thanks for reading!
UPDATE: the dpdk primary wasn't actually connected to the internet. It is simply using an interface of a virtual machine. The DPDK secondary and primary are both running inside a virtual machine and the interface used by primary is connected to a host interface through a bridge. So I can watch the bridge in question on the host using tcpdump.
I tried something to put the L2 packet inside the mbuf on secondary and it looks like this:
(you can also check github)
// After receiving something on the channel
// I want to send it to the primary DPDK
// And the primary will send it to hardware NIC
let mut my_buffer = self.do_rte_mempool_get();
while let Err(er) = my_buffer {
warn!("rte_mempool_get failed, trying again.");
my_buffer = self.do_rte_mempool_get();
// it may fail if not enough entries are available.
}
warn!("rte_mempool_get success");
// Let's just send an empty packet for starters.
let my_buffer = my_buffer.unwrap();
let my_buffer_struct: *mut rte_mbuf = my_buffer as (*mut rte_mbuf);
unsafe {
// the packet buffer, not the mbuf
let buf_addr: *mut c_void = (*my_buffer_struct).buf_addr;
let mut real_buf_addr = buf_addr.offset((*my_buffer_struct).data_off as isize);
//try to copy the Vec<u8> inside the mbuf
copy(my_data.as_mut_ptr(), real_buf_addr as *mut u8, my_data.len());
(*my_buffer_struct).data_len = my_data.len() as u16;
};
(my_data is the Vec in the above code snippet)
Now, on the primary DPDK I am receiving those bytes which are the bytes of the L2 packet. I printed them and they are the same as in secondary which is great.
for (;;)
{
// receive packets on rte ring
// then send them to NIC
struct rte_mbuf *bufs[BURST_SIZE];
void *mbuf;
unsigned char* my_packet;
uint16_t data_len;
uint16_t i = 0;
if (rte_ring_dequeue(recv_ring, &mbuf) < 0) {
continue;
}
printf("Received mbuf.\n");
my_packet = ((unsigned char *)(*(struct rte_mbuf *)mbuf).buf_addr) + ((struct rte_mbuf *)mbuf)->data_off;
data_len = ((struct rte_mbuf *)mbuf)->data_len;
for (i = 0; i < data_len; i++) {
printf("%d ", (uint8_t)my_packet[i]);
}
printf("\n");
//for now I just want to test it out so I stop here
// rte_pktmbuf_free(mbuf);
// continue;
//* Send packet to port */
bufs[0] = (struct rte_mbuf *)mbuf;
uint16_t nbPackets = 1;
const uint16_t nb_tx = rte_eth_tx_burst(port, 0,
bufs, nbPackets);
// /* Free any unsent packets. */
if (unlikely(nb_tx < nbPackets))
{
rte_pktmbuf_free(bufs[nbPackets]);
}
But the issue is that after sending the mbuf from primary with eth_tx_burst, I cannot see any packet while using tcpdump on the host.
So I am guessing I am not wrapping the packet inside the mbuf properly.
I hope it makes more sense.
#Mihai, if one needs to create a DPDK buffer in secondary and send it via RTE_RING following are the steps to do so
Start the secondary application
Get the Mbuf pool ptr via rte_mempool_lookup
Allocate mbuf from mbuf pool via rte_pktmbuf_alloc
set minimum fields in mbuf such as pkt_len, data_len, next and nb_segs to appropriate values.
fetch the starting of the region to mem copy your custom packet with rte_pktmbuf_mtod_offset or rte_pktmbuf_mtod
then memcopy the content from user vector to DPDK area
Note: based on the checksum offload, actual frame len and chain mbuf mode other fields need to be updated.
code snippet
mbuf_ptr = rte_pktmbuf_alloc(mbuf_pool);
mbuf_ptr->data_len = [size of vector preferably under 1500];
mbuf_ptr->pkt_len = mbuf_ptr->data_len;
struct rte_ether_hdr *eth_hdr = rte_pktmbuf_mtod(created_pkt, struct rte_ether_hdr *);
rte_memcpy(&eth_addr, &user_buffer, mbuf_ptr->data_len);
Note: similar to the above code has been implemented in Rust + C wireguard to be enabled with DPDK.
Please rework with the above code into your code
warn!("rte_mempool_get success");
// Let's just send an empty packet for starters.
let my_buffer = my_buffer.unwrap();
let my_buffer_struct: *mut rte_mbuf = my_buffer as (*mut rte_mbuf);
unsafe {
// the packet buffer, not the mbuf
let buf_addr: *mut c_void = (*my_buffer_struct).buf_addr;
let mut real_buf_addr = buf_addr.offset((*my_buffer_struct).data_off as isize);
//try to copy the Vec<u8> inside the mbuf
copy(my_data.as_mut_ptr(), real_buf_addr as *mut u8, my_data.len());
(*my_buffer_struct).data_len = my_data.len() as u16;
};
unsafe {
warn!("Length of segment buffer: {}", (*my_buffer_struct).buf_len);
warn!("Data offset: {}", (*my_buffer_struct).data_off);
let buf_addr: *mut c_void = (*my_buffer_struct).buf_addr;
let real_buf_addr = buf_addr.offset((*my_buffer_struct).data_off as isize);
warn!("Address of buf_addr: {:?}", buf_addr);
warn!("Address of buf_addr + data_off: {:?}", real_buf_addr);
warn!("\n");
};

STM32H7 SPI DMA transfer, always in busy transfer state, HAL_SPI_STATE_BUSY_TX

I am trying to transfer the data in SPI using DMA, where as my Hal status is HAL_SPI_STATUS_BUSY_TX. The required status is HAL_SPI_STATE_READY. I want to send some bulk data and command(single byte) through SPI. Is it possible to switch between DMA and non DMA mode respectively. As shown in the image, it loops in the while.
hdma1_tx.Instance = DMA1_Stream7;
hdma1_tx.Init.FIFOMode = DMA_FIFOMODE_DISABLE;
hdma1_tx.Init.FIFOThreshold = DMA_FIFO_THRESHOLD_FULL;
hdma1_tx.Init.MemBurst = DMA_MBURST_INC4;
hdma1_tx.Init.PeriphBurst = DMA_PBURST_INC4;
hdma1_tx.Init.Request = DMA_REQUEST_SPI1_TX;
hdma1_tx.Init.Direction = DMA_MEMORY_TO_PERIPH;
hdma1_tx.Init.PeriphInc = DMA_PINC_DISABLE;
hdma1_tx.Init.MemInc = DMA_MINC_ENABLE;
hdma1_tx.Init.PeriphDataAlignment = DMA_PDATAALIGN_BYTE;
hdma1_tx.Init.MemDataAlignment = DMA_MDATAALIGN_BYTE;
hdma1_tx.Init.Mode = DMA_NORMAL;
hdma1_tx.Init.Priority = DMA_PRIORITY_LOW;
if(HAL_DMA_Init(&hdma1_tx) != HAL_OK)
{
// Error
}
/* Associate the initialized DMA handle to the the SPI handle */
__HAL_LINKDMA(hspi, hdmatx, hdma1_tx);
/* Configure the DMA handler for Transmission process */
hdma_rx.Instance = DMA1_Stream1;
hdma_rx.Init.FIFOMode = DMA_FIFOMODE_DISABLE;
hdma_rx.Init.FIFOThreshold = DMA_FIFO_THRESHOLD_FULL;
hdma_rx.Init.MemBurst = DMA_MBURST_INC4;
hdma_rx.Init.PeriphBurst = DMA_PBURST_INC4;
hdma_rx.Init.Request = DMA_REQUEST_SPI1_RX;
hdma_rx.Init.Direction = DMA_PERIPH_TO_MEMORY;
hdma_rx.Init.PeriphInc = DMA_PINC_DISABLE;
hdma_rx.Init.MemInc = DMA_MINC_ENABLE;
hdma_rx.Init.PeriphDataAlignment = DMA_PDATAALIGN_BYTE;
hdma_rx.Init.MemDataAlignment = DMA_MDATAALIGN_BYTE;
hdma_rx.Init.Mode = DMA_NORMAL;
hdma_rx.Init.Priority = DMA_PRIORITY_HIGH;
HAL_DMA_Init(&hdma_rx);
/* Associate the initialized DMA handle to the the SPI handle */
__HAL_LINKDMA(hspi, hdmarx, hdma_rx);
/*##-4- Configure the NVIC for DMA #########################################*/
/* NVIC configuration for DMA transfer complete interrupt (SPI1_TX) */
HAL_NVIC_SetPriority(DMA1_Stream7_IRQn, 1, 1);
HAL_NVIC_EnableIRQ(DMA1_Stream7_IRQn);
HAL_NVIC_SetPriority(SPI1_IRQn, 1, 0);
HAL_NVIC_EnableIRQ(SPI1_IRQn);
HAL_STATUS must be HAL_SPI_STATE_READY.
My data length is loaded in NDTR.
After SPI enabled NDTR = 0x00
I changed the TX DMA stream from DAM1_Stream7 to DMA1_Stream2, It got solved. Do not know root cause why not working at stream7.
One reasone which could cause such problem is that the Variable which stores data to be sent is placed in the wrong RAM reigon, review you map file and modify linker script
you can find more here
https://community.st.com/s/article/FAQ-DMA-is-not-working-on-STM32H7-devices

Why doesn´t my program send ARP requests (c++)?

I am learning low level sockets with c++. I have done a simple program that shall send an ARP request. The socket seems to send the packet but I cannot catch it with Wireshark. I have another small program that also sends ARP packets and those packets are captured by Wireshark (my program below is inspired by that program).
Have I done something wrong?
Removed code
EDIT
Removed code
EDIT 2
It seems that I also need to include ethernet header data in the packet, so I now make a packet containing ethernet header and ARP header data. Now the packet goes away and is captured by Wireshark. But Wireshark says it is gratuitous. As you can see, nor IP or MAC address of sender and receiver seem to have been set properly.
36 13.318179 Cimsys_33:44:55 Broadcast ARP 42 Gratuitous ARP for <No address> (Request)
EDIT 3
/*Fill arp header data*/
p.arp.ea_hdr.ar_hrd = htons(ARPHRD_ETHER);
p.arp.ea_hdr.ar_pro = htons(ETH_P_IP);
p.arp.ea_hdr.ar_hln = ETH_ALEN; // Must be pure INTEGER, not called with htons(), as I did
p.arp.ea_hdr.ar_pln = 4; // Must be pure INTEGER, not called with htons(), as I did
p.arp.ea_hdr.ar_op = htons(ETH_P_ARP);
This code does not look quite right:
struct in_addr *s_in_addr = (in_addr*)malloc(sizeof(struct in_addr));
struct in_addr *t_in_addr = (in_addr*)malloc(sizeof(struct in_addr));
s_in_addr->s_addr = inet_addr("192.168.1.5"); // source ip
t_in_addr->s_addr = inet_addr("192.168.1.6"); // target ip
memcpy(arp->arp_spa, &s_in_addr, 6);
memcpy(arp->arp_tpa, &t_in_addr, 6);
In the memcpy You care copying 6 bytes out. However, you are taking an address of a pointer type, which makes it a pointer to a pointer type. I think you meant to just pass in s_in_addr and t_in_addr.
Edit: Alan Curry notes that you are copying 6 bytes from and to objects that are only 4 bytes long.
However, it doesn't seem like the dynamic allocation is doing your code any good, you should just create the the s_in_addr and t_in_addr variables off the stack. Than, you would not need to change your memcpy code.
struct in_addr s_in_addr;
struct in_addr t_in_addr;
s_in_addr.s_addr = inet_addr("192.168.1.5"); // source ip
t_in_addr.s_addr = inet_addr("192.168.1.6"); // target ip
memcpy(arp->arp_spa, &s_in_addr, sizeof(arg->arp_spa));
memcpy(arp->arp_tpa, &t_in_addr, sizeof(arg->arg_tpa));
There is a similar problem with your arp packet itself. So you should allocate it off the stack. To prevent myself from making a lot of code changes, I'll illustrate it slightly differently:
struct ether_arp arp_packet;
struct ether_arp *arp = &arp_packet;
//...
for(int i = 0; i < 10; i++) {
if (sendto(sock, arp, sizeof(arp_packet), 0,
(struct sockaddr *)&sending_socket,
sizeof(sending_socket)) < 0) {
std::cout << "Could not send!" << std::endl;
}
}
#user315052 say that you should use memcpy(arp->arp_spa, &s_in_addr, sizeof(arg->arp_spa)); but this code just copy first 4 bytes of s_in_addr to arp->arp_spa that absolutely do nothing!
so just try this:
* (int32_t*) arp->arp_spa = inet_addr("192.168.1.1")
* (int32_t*) arp->arp_tpa = inet_addr("192.168.1.2")

Why do I not see MSG_EOR for SOCK_SEQPACKET on linux?

I have two processes which are communicating over a pair of sockets created with socketpair() and SOCK_SEQPACKET. Like this:
int ipc_sockets[2];
socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, ipc_sockets);
As I understand it, I should see MSG_EOR in the msg_flags member of "struct msghdr" when receiving a SOCK_SEQPACKET record. I am setting MSG_EOR in sendmsg() to be certain that the record is marked MSG_EOR, but I do not see it when receiving in recvmsg(). I've even tried to set MSG_EOR in the msg_flags field before sending the record, but that made no difference at all.
I think I should see MSG_EOR unless the record was cut short by, e.g. a signal, but I do not. Why is that?
I've pasted my sending and receiving code in below.
Thanks,
jules
int
send_fd(int fd,
void *data,
const uint32_t len,
int fd_to_send,
uint32_t * const bytes_sent)
{
ssize_t n;
struct msghdr msg;
struct iovec iov;
memset(&msg, 0, sizeof(struct msghdr));
memset(&iov, 0, sizeof(struct iovec));
#ifdef HAVE_MSGHDR_MSG_CONTROL
union {
struct cmsghdr cm;
char control[CMSG_SPACE_SIZEOF_INT];
} control_un;
struct cmsghdr *cmptr;
msg.msg_control = control_un.control;
msg.msg_controllen = sizeof(control_un.control);
memset(msg.msg_control, 0, sizeof(control_un.control));
cmptr = CMSG_FIRSTHDR(&msg);
cmptr->cmsg_len = CMSG_LEN(sizeof(int));
cmptr->cmsg_level = SOL_SOCKET;
cmptr->cmsg_type = SCM_RIGHTS;
*((int *) CMSG_DATA(cmptr)) = fd_to_send;
#else
msg.msg_accrights = (caddr_t) &fd_to_send;
msg.msg_accrightslen = sizeof(int);
#endif
msg.msg_name = NULL;
msg.msg_namelen = 0;
iov.iov_base = data;
iov.iov_len = len;
msg.msg_iov = &iov;
msg.msg_iovlen = 1;
#ifdef __linux__
msg.msg_flags = MSG_EOR;
n = sendmsg(fd, &msg, MSG_EOR);
#elif defined __APPLE__
n = sendmsg(fd, &msg, 0); /* MSG_EOR is not supported on Mac
* OS X due to lack of
* SOCK_SEQPACKET support on
* socketpair() */
#endif
switch (n) {
case EMSGSIZE:
return EMSGSIZE;
case -1:
return 1;
default:
*bytes_sent = n;
}
return 0;
}
int
recv_fd(int fd,
void *buf,
const uint32_t len,
int *recvfd,
uint32_t * const bytes_recv)
{
struct msghdr msg;
struct iovec iov;
ssize_t n = 0;
#ifndef HAVE_MSGHDR_MSG_CONTROL
int newfd;
#endif
memset(&msg, 0, sizeof(struct msghdr));
memset(&iov, 0, sizeof(struct iovec));
#ifdef HAVE_MSGHDR_MSG_CONTROL
union {
struct cmsghdr cm;
char control[CMSG_SPACE_SIZEOF_INT];
} control_un;
struct cmsghdr *cmptr;
msg.msg_control = control_un.control;
msg.msg_controllen = sizeof(control_un.control);
memset(msg.msg_control, 0, sizeof(control_un.control));
#else
msg.msg_accrights = (caddr_t) &newfd;
msg.msg_accrightslen = sizeof(int);
#endif
msg.msg_name = NULL;
msg.msg_namelen = 0;
iov.iov_base = buf;
iov.iov_len = len;
msg.msg_iov = &iov;
msg.msg_iovlen = 1;
if (recvfd)
*recvfd = -1;
n = recvmsg(fd, &msg, 0);
if (msg.msg_flags) { // <== I should see MSG_EOR here if the entire record was received
return 1;
}
if (bytes_recv)
*bytes_recv = n;
switch (n) {
case 0:
*bytes_recv = 0;
return 0;
case -1:
return 1;
default:
break;
}
#ifdef HAVE_MSGHDR_MSG_CONTROL
if ((NULL != (cmptr = CMSG_FIRSTHDR(&msg)))
&& cmptr->cmsg_len == CMSG_LEN(sizeof(int))) {
if (SOL_SOCKET != cmptr->cmsg_level) {
return 0;
}
if (SCM_RIGHTS != cmptr->cmsg_type) {
return 0;
}
if (recvfd)
*recvfd = *((int *) CMSG_DATA(cmptr));
}
#else
if (recvfd && (sizeof(int) == msg.msg_accrightslen))
*recvfd = newfd;
#endif
return 0;
}
With SOCK_SEQPACKET unix domain sockets the only way for the message to be cut short is if the buffer you give to recvmsg() isn't big enough (and in that case you'll get MSG_TRUNC).
POSIX says that SOCK_SEQPACKET sockets must set MSG_EOR at the end of a record, but Linux unix domain sockets don't.
(Refs: POSIX 2008 2.10.10 says SOCK_SEQPACKET must support records, and 2.10.6 says record boundaries are visible to the receiver via the MSG_EOR flag.)
What a 'record' means for a given protocol is up to the implementation to define.
If Linux did implement MSG_EOR for unix domain sockets, I think the only sensible way would be to say that each packet was a record in itself, and so always set MSG_EOR (or maybe always set it when not setting MSG_TRUNC), so it wouldn't be informative anyway.
That's not what MSG_EOR is for.
Remember that the sockets API is an abstraction over a number of different protocols, including UNIX filesystem sockets, socketpairs, TCP, UDP, and many many different network protocols, including X.25 and some entirely forgotten ones.
MSG_EOR is to signal end of record where that makes sense for the underlying protocol. I.e. it is to pass a message to the next layer down that "this completes a record". This may affect for example, buffering, causing the flushing of a buffer. But if the protocol itself doesn't have a concept of a "record" there is no reason to expect the flag to be propagated.
Secondly, if using SEQPACKET you must read the entire message at once. If you do not the remainder will be discarded. That's documented. In particular, MSG_EOR is not a flag to tell you that this is the last part of the packet.
Advice: You are obviously writing a non-SEQPACKET version for use on MacOS. I suggest you dump the SEQPACKET version as it is only going to double the maintenance and coding burden. SOCK_STREAM is fine for all platforms.
When you read the docs, SOCK_SEQPACKET differs from SOCK_STREAM in two distinct ways. Firstly -
Sequenced, reliable, two-way connection-based data transmission path for datagrams of fixed maximum length; a consumer is required to read an entire packet with each input system call.
-- socket(2) from Linux manpages project
aka
For message-based sockets, such as SOCK_DGRAM and SOCK_SEQPACKET, the entire message shall be read in a single operation. If a message is too long to fit in the supplied buffers, and MSG_PEEK is not set in the flags argument, the excess bytes shall be discarded, and MSG_TRUNC shall be set in the msg_flags member of the msghdr structure.
-- recvmsg() in POSIX standard.
In this sense it is similar to SOCK_DGRAM.
Secondly each "datagram" (Linux) / "message" (POSIX) carries a flag called MSG_EOR.
However Linux SOCK_SEQPACKET for AF_UNIX does not implement MSG_EOR. The current docs do not match reality :-)
Allegedly some SOCK_SEQPACKET implementations do the other one. And some implement both. So that covers all the possible different combinations :-)
[1] Packet oriented protocols generally use packet level reads with
truncation / discard semantics and no MSG_EOR. X.25, Bluetooth, IRDA,
and Unix domain sockets use SOCK_SEQPACKET this way.
[2] Record oriented protocols generally use byte stream reads and MSG_EOR
no packet level visibility, no truncation / discard. DECNet and ISO TP use SOCK_SEQPACKET that way.
[3] Packet / record hybrids generally use SOCK_SEQPACKET with truncation /
discard semantics on the packet level, and record terminating packets
marked with MSG_EOR. SPX and XNS SPP use SOCK_SEQPACKET this way.
https://mailarchive.ietf.org/arch/msg/tsvwg/9pDzBOG1KQDzQ2wAul5vnAjrRkA
You've shown an example of paragraph 1.
Paragraph 2 also applies to SOCK_SEQPACKET as defined for SCTP. Although by default it sets MSG_EOR on every sendmsg(). The option to disable this is called SCTP_EXPLICIT_EOR.
Paragraph 3, the one most consistent with the docs, seems to be the most obscure case.
And even the docs are not properly consistent with themselves.
The SOCK_SEQPACKET socket type is similar to the SOCK_STREAM type, and is also connection-oriented. The only difference between these types is that record boundaries are maintained using the SOCK_SEQPACKET type. A record can be sent using one or more output operations and received using one or more input operations, but a single operation never transfers parts of more than one record. Record boundaries are visible to the receiver via the MSG_EOR flag in the received message flags returned by the recvmsg() function. -- POSIX standard