When using the function getsockopt(...) with the level SOL_SOCKET and option SO_BSP_STATE, I am receiving the WSA error code WSAEFAULT, which states the following:
"One of the optval or the optlen parameters is not a valid part of the user address space, or the optlen parameter is too small."
However, I was passing in a correctly sized, user-mode buffer:
/* ... */
HRESULT Result = E_UNEXPECTED;
CSADDR_INFO Info = { 0 }; // Placed on the stack.
int InfoSize = sizeof (CSADDR_INFO); // The size of the input buffer to `getsockopt()`.
// Get the local address information from the raw `SOCKET`.
if (getsockopt (this->WsaSocket,
SOL_SOCKET,
SO_BSP_STATE,
reinterpret_cast <char *> (&Info),
&InfoSize) == SOCKET_ERROR)
{
Result = HRESULT_FROM_WIN32 (WSAGetLastError ());
}
else
{
Result = S_OK;
}
/* ... */
According to the socket option SO_BSP_STATE documentation under the remarks section of the getsockopt(...) function, the return value is of type CSADDR_INFO. Furthermore, the Microsoft documentation page for the SO_BSP_STATE socket option, states the following requirements:
optval:
"[...] This parameter should point to buffer equal to or larger than the size of a CSADDR_INFO structure."
optlen:
"[...] This size must be equal to or larger than the size of a CSADDR_INFO structure."
After doing some research, I stumbled upon some test code from WineHQ that was passing in more memory than sizeof(CSADDR_INFO) when calling getsockopt(...) (see lines 1305 and 1641):
union _csspace
{
CSADDR_INFO cs;
char space[128];
} csinfoA, csinfoB;
It also looks like the project ReacOS also references this same exact code (see reference). Even though this is a union, because sizeof(CSADDR_INFO) is always less than 128, csinfoA will always be 128 bytes in size.
Therefore, this got me wondering how many bytes the socket option SO_BSP_STATE actually requires when calling getsockopt(...). I created the following complete example (via Visual Studio 2019 / C++17) that illustrates that in fact SO_BSP_STATE requires a buffer more than sizeof(CSADDR_INFO), which is in direct contrast to the Microsoft published documentation:
/**
* #note This example was created and compiled in Visual Studio 2019.
*/
#define WIN32_LEAN_AND_MEAN
#include <Windows.h>
#include <iostream>
#include <winsock2.h>
#include <ws2tcpip.h>
#pragma comment(lib, "ws2_32.lib")
/**
* #brief The number of bytes to increase the #ref CSADDR_INFO_PLUS_EXTRA_SPACE structure by.
* #note Alignment and pointer size changes when compiling for Intel x86 versus Intel x64.
* The extra bytes required therefore vary.
*/
#if defined(_M_X64) || defined(__amd64__)
#define EXTRA_SPACE (25u) // Required extra space when compiling for X64
#else
#define EXTRA_SPACE (29u) // Required extra space when compiling for X86
#endif
/**
* #brief A structure to add extra space passed the `CSADDR_INFO` structure.
*/
typedef struct _CSADDR_INFO_PLUS_EXTRA_SPACE
{
/**
* #brief The standard structure to store Windows Sockets address information.
*/
CSADDR_INFO Info;
/**
* #brief A blob of extra space.
*/
char Extra [EXTRA_SPACE];
} CSADDR_INFO_PLUS_EXTRA_SPACE;
/**
* #brief The main entry function for this console application for demonstrating an issue with `SO_BSP_STATE`.
*/
int main (void)
{
HRESULT Result = S_OK; // The execution result of this console application.
SOCKET RawSocket = { 0 }; // The raw WSA socket index variable the references the socket's memory.
WSADATA WindowsSocketsApiDetails = { 0 }; // The WSA implementation details about the current WSA DLL.
CSADDR_INFO_PLUS_EXTRA_SPACE Info = { 0 }; // The structure `CSADDR_INFO` plus an extra blob of memory.
int InfoSize = sizeof (CSADDR_INFO_PLUS_EXTRA_SPACE);
std::cout << "Main Entry!" << std::endl;
// Request for the latest Windows Sockets API (WSA) (a.k.a. Winsock) DLL available on this system.
if (WSAStartup (MAKEWORD(2,2),
&WindowsSocketsApiDetails) != 0)
{
Result = HRESULT_FROM_WIN32 (WSAGetLastError ());
}
// Create a blank TCP socket using IPv4.
if ((RawSocket = WSASocketW (AF_INET,
SOCK_STREAM,
IPPROTO_TCP,
nullptr,
0,
0)) == INVALID_SOCKET)
{
Result = HRESULT_FROM_WIN32 (WSAGetLastError ());
}
else
{
// Get the local address information from the raw `SOCKET`.
if (getsockopt (RawSocket,
SOL_SOCKET,
SO_BSP_STATE,
reinterpret_cast <char *> (&Info),
&InfoSize) == SOCKET_ERROR)
{
std::cout << "Failed obtained the socket's state information!" << std::endl;
Result = HRESULT_FROM_WIN32 (WSAGetLastError ());
}
else
{
std::cout << "Successfully obtained the socket's state information!" << std::endl;
Result = S_OK;
}
}
// Clean up the entire Windows Sockets API (WSA) environment and release the DLL resource.
if (WSACleanup () != 0)
{
Result = HRESULT_FROM_WIN32 (WSAGetLastError ());
}
std::cout << "Exit Code: 0x" << std::hex << Result << std::endl;
return Result;
}
(If you change the EXTRA_SPACE define to equal 0 or 1, then you will see the issue I am outlining.)
Due to the default structure alignment and pointer size change when compiling for either X86 or X64 in Visual Studio 2019, the required extra space beyond the CSADDR_INFO structure can vary:
Space required for X86: sizeof(CSADDR_INFO) + 29
Space required for X64: sizeof(CSADDR_INFO) + 25
This is completely arbitrary as shown and if you don't add this arbitrary padding, then getsockopt(...) will fail. This makes me call into question if the data I am getting back is even correct. This looks like there might be a missing footnote in the published documentation, however, I very well could be misunderstanding something (very likely this).
My Question(s):
What is tied to the buffer size that SO_BSP_STATE actually requires (i.e. a structure, etc.)? Because, it is clearly not sizeof(CSADDR_INFO) as documented.
Is the Microsoft documentation incorrect (reference)? If not, what issues are spotted in my above code example, if EXTRA_SPACE is set to 0, in order for getsockopt(...) to be successful?
I think what's happening here is as follows:
CSADDR_INFO is defined like so:
typedef struct _CSADDR_INFO {
SOCKET_ADDRESS LocalAddr;
SOCKET_ADDRESS RemoteAddr;
INT iSocketType;
INT iProtocol;
} CSADDR_INFO;
Specifically, it contains two SOCKET_ADDRESS structures.
SOCKET_ADDRESS is defined like so:
typedef struct _SOCKET_ADDRESS {
LPSOCKADDR lpSockaddr;
INT iSockaddrLength;
} SOCKET_ADDRESS;
The lpSockaddr of the SOCKET_ADDRESS structure is a pointer to a SOCK_ADDR structure, and the length of that varies by address family (ipv4 vs ipv6, for example).
It follows that getsockopt needs somewhere to store these SOCK_ADDR structures and that's where your 'blob' of extra data comes in - they're in there, pointed to by the two SOCKET_ADDRESS structures. It further follows that the worst-case scenario for the size of this extra data may be more than you are allowing, since, if they are ipv6 addresses, they will be longer than if they are ipv4 addressses.
Of course, the documentation should spell all of this out, but, as is sometimes the case, the authors probably didn't understand how things work. You might like to raise a bug report.
Related
Running in Docker on a MacOS, I have a simple server and client setup to measure how fast I can allocate data on the client and send it to the server. The tests are done using loopback (in the same docker container). The message size for my tests was 1000000 bytes.
When I set SO_RCVBUF and SO_SNDBUF to their respective defaults, the performance halves.
SO_RCVBUF defaults to 65536 and SO_SNDBUF defaults to 1313280 (retrieved by calling getsockopt and dividing by 2).
Tests:
When I test setting neither buffer size, I get about 7 Gb/s throughput.
When I set one buffer or the other to the default (or higher) I get 3.5 Gb/s.
When I set both buffer sizes to the default I get 2.5 Gb/s.
Server code: (cs is an accepted stream socket)
void tcp_rr(int cs, uint64_t& processed) {
/* I remove this entire thing and performance improves */
if (setsockopt(cs, SOL_SOCKET, SO_RCVBUF, &ENV.recv_buf, sizeof(ENV.recv_buf)) == -1) {
perror("RCVBUF failure");
return;
}
char *buf = (char *)malloc(ENV.msg_size);
while (true) {
int recved = 0;
while (recved < ENV.msg_size) {
int recvret = recv(cs, buf + recved, ENV.msg_size - recved, 0);
if (recvret <= 0) {
if (recvret < 0) {
perror("Recv error");
}
return;
}
processed += recvret;
recved += recvret;
}
}
free(buf);
}
Client code: (s is a connected stream socket)
void tcp_rr(int s, uint64_t& processed, BenchStats& stats) {
/* I remove this entire thing and performance improves */
if (setsockopt(s, SOL_SOCKET, SO_SNDBUF, &ENV.send_buf, sizeof(ENV.send_buf)) == -1) {
perror("SNDBUF failure");
return;
}
char *buf = (char *)malloc(ENV.msg_size);
while (stats.elapsed_millis() < TEST_TIME_MILLIS) {
int sent = 0;
while (sent < ENV.msg_size) {
int sendret = send(s, buf + sent, ENV.msg_size - sent, 0);
if (sendret <= 0) {
if (sendret < 0) {
perror("Send error");
}
return;
}
processed += sendret;
sent += sendret;
}
}
free(buf);
}
Zeroing in on SO_SNDBUF:
The default appears to be: net.ipv4.tcp_wmem = 4096 16384 4194304
If I setsockopt to 4194304 and getsockopt (to see what's currently set) it returns 425984 (10x less than I requested).
Additionally, it appears a setsockopt sets a lock on buffer expansion (for send, the lock's name is SOCK_SNDBUF_LOCK which prohibits adaptive expansion of the buffer). The question then is - why can't I request the full size buffer?
Clues for what is going on come from the kernel source handle for SO_SNDBUF (and SO_RCVBUF but we'll focus on SO_SNDBUF below).
net/core/sock.c contains implementations for the generic socket operations, including the SOL_SOCKET getsockopt and setsockopt handles.
Examining what happens when we call setsockopt(s, SOL_SOCKET, SO_SNDBUF, ...):
case SO_SNDBUF:
/* Don't error on this BSD doesn't and if you think
* about it this is right. Otherwise apps have to
* play 'guess the biggest size' games. RCVBUF/SNDBUF
* are treated in BSD as hints
*/
val = min_t(u32, val, sysctl_wmem_max);
set_sndbuf:
sk->sk_userlocks |= SOCK_SNDBUF_LOCK;
sk->sk_sndbuf = max_t(int, val * 2, SOCK_MIN_SNDBUF);
/* Wake up sending tasks if we upped the value. */
sk->sk_write_space(sk);
break;
case SO_SNDBUFFORCE:
if (!capable(CAP_NET_ADMIN)) {
ret = -EPERM;
break;
}
goto set_sndbuf;
Some interesting things pop out.
First of all, we see that the max possible value is sysctl_wmem_max, a setting which is difficult to pin down within a docker container. We know from the context above that this is likely 212992 (half your max value you retrieved after trying to set 4194304).
Secondly, we see SOCK_SNDBUF_LOCK being set. This setting is in my opinion not well documented in the man pages, but it appears to lock dynamic adjustment of the buffer size.
For example, in the function tcp_should_expand_sndbuf we get:
static bool tcp_should_expand_sndbuf(const struct sock *sk)
{
const struct tcp_sock *tp = tcp_sk(sk);
/* If the user specified a specific send buffer setting, do
* not modify it.
*/
if (sk->sk_userlocks & SOCK_SNDBUF_LOCK)
return false;
...
So what is happening in your code? You attempt to set the max value as you understand it, but it's being truncated to something 10x smaller by the sysctl sysctl_wmem_max. This is then made far worse by the fact that setting this option now locks the buffer to that smaller size. The strange part is that apparently dynamically resizing the buffer doesn't have this same restriction, but can go all the way to the max.
If you look at the first code snip above, you see the SO_SNDBUFFORCE option. This will disregard the sysctl_wmem_max and allow you to set essentially any buffer size provided you have the right permissions.
It turns out processes launched in generic docker containers don't have CAP_NET_ADMIN, so in order to use this socket option, you must run in --privileged mode. However, if you do, and if you force the max size, you will see your benchmark return the same throughput as not setting the option at all and allowing it to grow dynamically to the same size.
I'm having an issue where I send a message to user-mode from kernel-mode using FltSendMessage that expects a reply. The struct being passed contains an int that is set to either 0 or 1. User-mode replies by setting this flag and calling FilterReplyMessage. However, when the message is received by the kernel, its value is always 56. No matter what number I set the flag to in user-mode, the kernel always receives the value 56. I'm confused as to where my error is.
I've tried changing the data type of passFlag from int to other types (USHORT etc..) which I knew probably wouldn't make a difference, but was worth a try.
Because the kernel message is replied to successfully (Checking user-mode HRESULT returns no errors and there is no timeout so if no reply is received the system would hang, which it does not), I know the error must be with the buffers being passed between user-mode and kernel-mode. I can't seem to find the reason why the passFlag is not being interpreted correctly in kernel-mode.
Can anyone help?
Shared Structure:
typedef struct _REPLY_MESSAGE_STRUCT {
// Message header.
FILTER_REPLY_HEADER header;
// Flag to be set
// by user mode.
int passFlag;
}REPLY_MESSAGE_STRUCT, *PREPLY_MESSAGE_STRUCT;
Kernel Code:
DbgPrint("Sending Message...\n");
replyBuffer.passFlag = 0;
ULONG replySize = ((ULONG)sizeof(replyBuffer.header)) + ((ULONG)sizeof(replyBuffer));
REPLY_MESSAGE_STRUCT replyBuffer;
// Note: Try-catch statement has been omitted in this question to save time.
// In the actual code there is a try-catch statement surrounding FltSendMessage.
status = FltSendMessage(imageFilterData.filterHandle,
&imageFilterData.clientPort,
(PVOID)&sendingBuffer.messageBuffer,
sizeof(sendingBuffer.messageBuffer),
(PVOID)&replyBuffer,
&replySize,
0
);
// Always returns 56
// When a reply has been received.
DbgPrint("Message received: %i\n", replyBuffer.passFlag);
User code:
// User-mode buffer is the same size as kernel-mode buffer.
ULONG replySize = ((ULONG)sizeof(replyBuffer.header)) + ((ULONG)sizeof(replyBuffer));
replyMessage.header.MessageId = messageFromKernel.header.MessageId;
REPLY_MESSAGE_STRUCT replyMessage;
// User-mode setting flag.
replyMessage.passFlag = 1;
// Flag is changed to 1 successfully.
printf("Test: %i\n", replyMessage.passFlag);
// Reply is sent successfully, but flag value on kernel end is always 56
hResult = FilterReplyMessage(port,
&replyMessage.header,
replySize);
_com_error err2(hResult);
errorMessage = err2.ErrorMessage();
// No errors.
printf("Result: %s\n", errorMessage);
What I have tried:
Changing the datatype of passFlag.
Going through every step before and after FltSendMessage and FilterReply message to find if the value is being changed before being sent back to the kernel.
you are using error data in call FltSendMessage:
ReplyBuffer is pointer to custom user defined data. it must not begin from FILTER_REPLY_HEADER
SenderBuffer is pointer to custom user defined data. it must not begin from FILTER_MESSAGE_HEADER
first of all you need define structures, that are shared between kernel and user mode, for message and reply. for example
struct SCANNER_NOTIFICATION {
// any custom data
int someData;
};
struct SCANNER_REPLY {
// any custom data
int passFlag;
};
and in kernel mode you direct use it as is:
SCANNER_NOTIFICATION send;
SCANNER_REPLY reply;
ULONG ReplyLength = sizeof(reply);
FltSendMessage(*, *, &send, sizeof(send), &reply, &ReplyLength, *);
in user mode you need define 2 additional structures:
struct SCANNER_MESSAGE : public FILTER_MESSAGE_HEADER, public SCANNER_NOTIFICATION {};
struct SCANNER_REPLY_MESSAGE : public FILTER_REPLY_HEADER, public SCANNER_REPLY {};
(i use c++ style here, when here used c style)
and in user mode we need use next, for example:
SCANNER_MESSAGE* mesage;
FilterGetMessage(*, mesage, sizeof(SCANNER_MESSAGE), *);
and
SCANNER_REPLY_MESSAGE reply;
reply.MessageId = mesage->MessageId;
FilterReplyMessage(*, &reply, sizeof(reply));
I have the given code:
#include <winsock2.h>
#include <sys/time.h>
#include <iostream>
int main()
{
WSADATA wsaData;
if (WSAStartup(MAKEWORD(2, 2), &wsaData) != 0)
{
std::cout << "WSA Initialization failed!" << std::endl;
WSACleanup();
}
timeval time;
time.tv_sec = 1;
time.tv_usec = 0;
int retval = select(0, NULL, NULL, NULL, &time);
if (retval == SOCKET_ERROR)
{
std::cout << WSAGetLastError() << std::endl;
}
return 0;
}
It prints 10022, which means error WSAEINVAL. According to this page, I can get this error only if:
WSAEINVAL: The time-out value is not valid, or all three descriptor parameters were null.
However, I have seen a few examples calling select() without any FD_SETs. Is it possible somehow? I need to do it in a client-side code to let the program sleep for short periods while it is not connected to the server.
However, I have seen a few examples calling select() without any
FD_SETs.
It will work in most OS's (that aren't Windows).
Is it possible somehow [under Windows]?
Not directly, but it's easy enough to roll your own wrapper around select() that gives you the behavior you want even under Windows:
int proper_select(int largestFileDescriptorValuePlusOne, struct fd_set * readFS, struct fd_set * writeFS, struct fd_set * exceptFS, struct timeVal * timeout)
{
#ifdef _WIN32
// Note that you *do* need to pass in the correct value
// for (largestFileDescriptorValuePlusOne) for this wrapper
// to work; Windows programmers sometimes just pass in a dummy value,
// because the current Windows implementation of select() ignores the
// parameter, but that's a portability-killing hack and wrong,
// so don't do it!
if ((largestFileDescriptorValuePlusOne <= 0)&&(timeout != NULL))
{
// Windows select() will error out on a timeout-only call, so call Sleep() instead.
Sleep(((timeout->tv_sec*1000000)+timeout->tv_usec)/1000);
return 0;
}
#endif
// in all other cases we just pass through to the normal select() call
return select(maxFD, readFS, writeFS, exceptFS, timeout);
}
... then just call proper_select() instead of select() and you're golden.
From the notorious and offensive Winsock 'lame list':
Calling select() with three empty FD_SETs and a valid TIMEOUT structure as a sleezy delay function.
Inexcusably lame.
Note the mis-spelling. The document is worth reading, if you can stand it, just to see the incredible depths hubris can attain. In case they've recanted, or discovered that they didn't invent the Sockets API, you could try it with empty FD sets instead of null parameters, but I don't hold out much hope.
I'm testing the funcion setsockopt() with the below code and I'm getting a behavior I don't understand:
Below is the code snippet I'm running (compiled on Ubuntu 12.04 64bit, Qt 4.8.x):
#include <QCoreApplication>
#include <sys/types.h> /* See NOTES */
#include <sys/socket.h>
#include <QDebug>
#include <netinet/in.h>
#include <netinet/in_systm.h>
#include <netinet/ip.h>
#include <netinet/ip_icmp.h>
int main(int argc, char *argv[])
{
QCoreApplication a(argc, argv);
int sock = ::socket(AF_INET, SOCK_RAW, IPPROTO_ICMP);
int res;
int bufferSizeByte = QString(argv[1]).toInt();
qDebug() << "Setting socket buffer size to" << bufferSizeByte << "bytes";
res = setsockopt( sock, SOL_SOCKET, SO_RCVBUF, (void*)&bufferSizeByte, sizeof(bufferSizeByte) );
if ( -1 == res )
{
qDebug() << "ERROR setting socket buffer size to" << bufferSizeByte << "bytes";
}
/*
* !! WARNING !!
* If we try setting the buff size over the kernel max: we do not get an error
*/
int readValue = 0;
unsigned int readLen = sizeof(readValue);
res = getsockopt( sock, SOL_SOCKET, SO_RCVBUF, (void*)&readValue, &readLen );
if ( -1 == res )
{
qDebug() << "ERROR reading socket buffer size";
}
else
{
qDebug() << "Read socket buffer size:" << readValue << "bytes";
Q_ASSERT ( readValue == bufferSizeByte*2 );
}
return a.exec();
}
Basically I'm setting the recv buffer size for a socket, and reading it back to verify that the operation really succeded.
Setting the buffer size to a value withing the one configured in the Linux kernel (/proc/sys/net/core/rmem_max) triggers the Q_ASSERT() as expecetd but I do not get the setsockopt error message.
For instance:
sergio#netbeast: sudo ./setsockopt 300000
Setting socket buffer size to 300000 bytes
Read socket buffer size: 262142 bytes
ASSERT: "readValue == bufferSizeByte*2" in file ../setsockopt/main.cpp, line 43
What I don't get is why the setsockopt() does not return an error
Any clue?
The implementaiton of sock_setsockopt() (which is what the system call setsockopt() ends up calling in the kernel) has a comment as to why setting a too large value does not incur an error. The comment indicates the reason is for compatibility with the original BSD implementation (and thus, for software written for BSD systems to be more easily ported to Linux):
/* Don't error on this BSD doesn't and if you think
* about it this is right. Otherwise apps have to
* play 'guess the biggest size' games. RCVBUF/SNDBUF
* are treated in BSD as hints
*/
Please note that if the maximum size is not exceeded when doing so (and if the minimum value has been satisfied) what is actually stored is double the value that was passed in to SO_RCVBUF. From the man page:
SO_RCVBUF Sets or gets the maximum socket receive buffer in bytes. The kernel doubles this value (to allow space for bookkeeping overhead) when it is set using setsockopt(2), and this doubled value is returned by getsockopt(2). The default value is set by the /proc/sys/net/core/rmem_default file, and the maximum allowed value is set by the /proc/sys/net/core/rmem_max file. The minimum (doubled) value for this option is 256.
I have two processes which are communicating over a pair of sockets created with socketpair() and SOCK_SEQPACKET. Like this:
int ipc_sockets[2];
socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, ipc_sockets);
As I understand it, I should see MSG_EOR in the msg_flags member of "struct msghdr" when receiving a SOCK_SEQPACKET record. I am setting MSG_EOR in sendmsg() to be certain that the record is marked MSG_EOR, but I do not see it when receiving in recvmsg(). I've even tried to set MSG_EOR in the msg_flags field before sending the record, but that made no difference at all.
I think I should see MSG_EOR unless the record was cut short by, e.g. a signal, but I do not. Why is that?
I've pasted my sending and receiving code in below.
Thanks,
jules
int
send_fd(int fd,
void *data,
const uint32_t len,
int fd_to_send,
uint32_t * const bytes_sent)
{
ssize_t n;
struct msghdr msg;
struct iovec iov;
memset(&msg, 0, sizeof(struct msghdr));
memset(&iov, 0, sizeof(struct iovec));
#ifdef HAVE_MSGHDR_MSG_CONTROL
union {
struct cmsghdr cm;
char control[CMSG_SPACE_SIZEOF_INT];
} control_un;
struct cmsghdr *cmptr;
msg.msg_control = control_un.control;
msg.msg_controllen = sizeof(control_un.control);
memset(msg.msg_control, 0, sizeof(control_un.control));
cmptr = CMSG_FIRSTHDR(&msg);
cmptr->cmsg_len = CMSG_LEN(sizeof(int));
cmptr->cmsg_level = SOL_SOCKET;
cmptr->cmsg_type = SCM_RIGHTS;
*((int *) CMSG_DATA(cmptr)) = fd_to_send;
#else
msg.msg_accrights = (caddr_t) &fd_to_send;
msg.msg_accrightslen = sizeof(int);
#endif
msg.msg_name = NULL;
msg.msg_namelen = 0;
iov.iov_base = data;
iov.iov_len = len;
msg.msg_iov = &iov;
msg.msg_iovlen = 1;
#ifdef __linux__
msg.msg_flags = MSG_EOR;
n = sendmsg(fd, &msg, MSG_EOR);
#elif defined __APPLE__
n = sendmsg(fd, &msg, 0); /* MSG_EOR is not supported on Mac
* OS X due to lack of
* SOCK_SEQPACKET support on
* socketpair() */
#endif
switch (n) {
case EMSGSIZE:
return EMSGSIZE;
case -1:
return 1;
default:
*bytes_sent = n;
}
return 0;
}
int
recv_fd(int fd,
void *buf,
const uint32_t len,
int *recvfd,
uint32_t * const bytes_recv)
{
struct msghdr msg;
struct iovec iov;
ssize_t n = 0;
#ifndef HAVE_MSGHDR_MSG_CONTROL
int newfd;
#endif
memset(&msg, 0, sizeof(struct msghdr));
memset(&iov, 0, sizeof(struct iovec));
#ifdef HAVE_MSGHDR_MSG_CONTROL
union {
struct cmsghdr cm;
char control[CMSG_SPACE_SIZEOF_INT];
} control_un;
struct cmsghdr *cmptr;
msg.msg_control = control_un.control;
msg.msg_controllen = sizeof(control_un.control);
memset(msg.msg_control, 0, sizeof(control_un.control));
#else
msg.msg_accrights = (caddr_t) &newfd;
msg.msg_accrightslen = sizeof(int);
#endif
msg.msg_name = NULL;
msg.msg_namelen = 0;
iov.iov_base = buf;
iov.iov_len = len;
msg.msg_iov = &iov;
msg.msg_iovlen = 1;
if (recvfd)
*recvfd = -1;
n = recvmsg(fd, &msg, 0);
if (msg.msg_flags) { // <== I should see MSG_EOR here if the entire record was received
return 1;
}
if (bytes_recv)
*bytes_recv = n;
switch (n) {
case 0:
*bytes_recv = 0;
return 0;
case -1:
return 1;
default:
break;
}
#ifdef HAVE_MSGHDR_MSG_CONTROL
if ((NULL != (cmptr = CMSG_FIRSTHDR(&msg)))
&& cmptr->cmsg_len == CMSG_LEN(sizeof(int))) {
if (SOL_SOCKET != cmptr->cmsg_level) {
return 0;
}
if (SCM_RIGHTS != cmptr->cmsg_type) {
return 0;
}
if (recvfd)
*recvfd = *((int *) CMSG_DATA(cmptr));
}
#else
if (recvfd && (sizeof(int) == msg.msg_accrightslen))
*recvfd = newfd;
#endif
return 0;
}
With SOCK_SEQPACKET unix domain sockets the only way for the message to be cut short is if the buffer you give to recvmsg() isn't big enough (and in that case you'll get MSG_TRUNC).
POSIX says that SOCK_SEQPACKET sockets must set MSG_EOR at the end of a record, but Linux unix domain sockets don't.
(Refs: POSIX 2008 2.10.10 says SOCK_SEQPACKET must support records, and 2.10.6 says record boundaries are visible to the receiver via the MSG_EOR flag.)
What a 'record' means for a given protocol is up to the implementation to define.
If Linux did implement MSG_EOR for unix domain sockets, I think the only sensible way would be to say that each packet was a record in itself, and so always set MSG_EOR (or maybe always set it when not setting MSG_TRUNC), so it wouldn't be informative anyway.
That's not what MSG_EOR is for.
Remember that the sockets API is an abstraction over a number of different protocols, including UNIX filesystem sockets, socketpairs, TCP, UDP, and many many different network protocols, including X.25 and some entirely forgotten ones.
MSG_EOR is to signal end of record where that makes sense for the underlying protocol. I.e. it is to pass a message to the next layer down that "this completes a record". This may affect for example, buffering, causing the flushing of a buffer. But if the protocol itself doesn't have a concept of a "record" there is no reason to expect the flag to be propagated.
Secondly, if using SEQPACKET you must read the entire message at once. If you do not the remainder will be discarded. That's documented. In particular, MSG_EOR is not a flag to tell you that this is the last part of the packet.
Advice: You are obviously writing a non-SEQPACKET version for use on MacOS. I suggest you dump the SEQPACKET version as it is only going to double the maintenance and coding burden. SOCK_STREAM is fine for all platforms.
When you read the docs, SOCK_SEQPACKET differs from SOCK_STREAM in two distinct ways. Firstly -
Sequenced, reliable, two-way connection-based data transmission path for datagrams of fixed maximum length; a consumer is required to read an entire packet with each input system call.
-- socket(2) from Linux manpages project
aka
For message-based sockets, such as SOCK_DGRAM and SOCK_SEQPACKET, the entire message shall be read in a single operation. If a message is too long to fit in the supplied buffers, and MSG_PEEK is not set in the flags argument, the excess bytes shall be discarded, and MSG_TRUNC shall be set in the msg_flags member of the msghdr structure.
-- recvmsg() in POSIX standard.
In this sense it is similar to SOCK_DGRAM.
Secondly each "datagram" (Linux) / "message" (POSIX) carries a flag called MSG_EOR.
However Linux SOCK_SEQPACKET for AF_UNIX does not implement MSG_EOR. The current docs do not match reality :-)
Allegedly some SOCK_SEQPACKET implementations do the other one. And some implement both. So that covers all the possible different combinations :-)
[1] Packet oriented protocols generally use packet level reads with
truncation / discard semantics and no MSG_EOR. X.25, Bluetooth, IRDA,
and Unix domain sockets use SOCK_SEQPACKET this way.
[2] Record oriented protocols generally use byte stream reads and MSG_EOR
no packet level visibility, no truncation / discard. DECNet and ISO TP use SOCK_SEQPACKET that way.
[3] Packet / record hybrids generally use SOCK_SEQPACKET with truncation /
discard semantics on the packet level, and record terminating packets
marked with MSG_EOR. SPX and XNS SPP use SOCK_SEQPACKET this way.
https://mailarchive.ietf.org/arch/msg/tsvwg/9pDzBOG1KQDzQ2wAul5vnAjrRkA
You've shown an example of paragraph 1.
Paragraph 2 also applies to SOCK_SEQPACKET as defined for SCTP. Although by default it sets MSG_EOR on every sendmsg(). The option to disable this is called SCTP_EXPLICIT_EOR.
Paragraph 3, the one most consistent with the docs, seems to be the most obscure case.
And even the docs are not properly consistent with themselves.
The SOCK_SEQPACKET socket type is similar to the SOCK_STREAM type, and is also connection-oriented. The only difference between these types is that record boundaries are maintained using the SOCK_SEQPACKET type. A record can be sent using one or more output operations and received using one or more input operations, but a single operation never transfers parts of more than one record. Record boundaries are visible to the receiver via the MSG_EOR flag in the received message flags returned by the recvmsg() function. -- POSIX standard