pcap_next call fills in pcap_pkthdr with len equal to zero - c++

I'm using libpcap of version 1.1.1 built as a static library(libpcap.a). When I try to execute a following block of code on RHEL 6 64 bit(The executable module itself is built as 32-bit ELF image) I get segmentation fault:
const unsigned char* packet;
pcap_pkthdr pcap_header = {0};
unsigned short ether_type = 0;
while ( ether_type != ntohs( 0x800 ) )
{
packet = pcap_next ( m_pcap_handle, &pcap_header );
if (packet != NULL)
{
memcpy ( &ether_type, &( packet[ 12 ] ), 2 );
}
else
{
/*Sleep call goes here*/
}
}
if ( raw_buff ->data_len >= pcap_header.caplen )
{
memcpy ( raw_buff->data, &(packet[14]), pcap_header.len -14 );
raw_buff->data_len = pcap_header.len -14;
raw_buff->timestamp = pcap_header.ts;
}
A bit of investigation revealed pcap_header.len field is equal to zero upon pcap_next return. In fact caplen field seems to reflect packet size correсtly. If I try to dump a packet memory from packet address - data seems to be valid. As of len field equal to zero I know it's invalid. It supposed to be at least as of caplen magnitude. Is it a bug? What steps shall I take to get this fixed?
GDB shows pcap_header contents as:
(gdb) p pcap_header
$1 = {ts = {tv_sec = 5242946, tv_usec = 1361456997}, caplen = 66, len = 0}
Maybe I can have some workaround applied? I don't want to upgrade libpcap version.

Kernels prior to the 2.6.27 kernel do not support running 32-bit binaries using libpcap 1.0 or later on a 64-bit kernel.
libpcap 1.0 and later use the "memory-mapped" capture mechanism on Linux kernels that have it available, and the first version of that mechanism did not ensure that the data structures shared between the kernel and code using the "memory-mapped" capture mechanism were laid out in memory the same way in 32-bit and 64-bit mode.
2.6 kernels prior to the 2.6.27 kernel have only the first version of that mechanism. The 2.6.27 kernel has the second version of that mechanism, which does ensure that the data structures are laid out in memory the same way in 32-bit and 64-bit mode, so that 32-bit user-mode code works the same atop 32-bit and 64-bit kernels.

Hopefully I googled for "https://bugzilla.redhat.com/show_bug.cgi?id=557728" defect description and it seems it is still relevant nowadays. The problem went away when I linked my application to a shared library version of libpcap instead of having it linked with a static one. Then a system gets my app linked to a libpcap at runtime which is being shipped with RHEL.
Sincerely yours, Alexander Chernyaev.

Related

Shared memory "Too many open files" but ipcs doesn't show many allocations

I'm writing unit tests for code which creates shared memory.
I only have a couple of tests. I make 4 allocations of shared memory and then it fails on the fifth.
After calling shmat() perror() says Too many open files:
template <typename T>
bool Attach(T** ptr, const key_type& key)
{
// shmemId was 262151
int32_t shmemId = shmget( key.key( ), ( size_t )0, 0644 );
if (shmemId < 0)
{
perror("Error: ");
return false;
}
*ptr = ( T * ) shmat(shmemId, 0, 0 );
if ( ( int64_t ) * ptr < 0 )
{
// Problem is here. perror() says 'Too many open files'
perror( "Error: ");
return false;
}
return true;
}
However, when I check ipcs -m -p I only have a couple of shared memory allocations.
T ID KEY MODE OWNER GROUP CPID LPID
Shared Memory:
m 262151 0x0000a028 --rw-r--r-- 3229 0
m 262152 0x0000a029 --rw-r--r-- 3229 0
In addition, when I check my OS shared memory limits sysctl -A | grep shm I get:
kern.sysv.shmall: 1024
kern.sysv.shmmax: 4194304
kern.sysv.shmmin: 1
kern.sysv.shmmni: 32
kern.sysv.shmseg: 8
security.mac.posixshm_enforce: 1
security.mac.sysvshm_enforce: 1
Are these variables large enough/are they the cause/what values should I have?
I'm sure I edited the file to increase them and restarted machine but perhaps it hasn't accepted (this is on Mac/OSX).
Your problem may be elsewhere.
Edit: This may be a shmmni limit of macOS. See below.
When I run your [simplified] code on my system (linux), the shmget fails.
You didn't specify IPC_CREAT to the third argument. If another process has created the segment, this may be okay.
But, it doesn't/shouldn't like a size of 0. The [linux] man page states that it returns an error (errno set to EINVAL) if the size is less than SHMMIN (which is 1).
That is what happened on my system. So, I adjusted the code to use a size of 1.
This was done [as I mentioned] on linux.
macOS may allow a size of 0, even if that doesn't make practical sense. (e.g.) It may round it up to a page size.
For shmat, it returns (void *) -1.
But, some systems can have valid addresses that have the high bit set. (e.g.) 0xFFE0000000000000 is a valid address, but would fail your if test because casting that to int64_t will test negative.
Better to do:
if ((int64_t) *ptr == (int64_t) -1)
Or [possibly better]:
if ((void *) *ptr == (void *) -1)
Note that errno is not set/changed if the call succeeds.
To verify this, do: errno = 0 before the shmat call. If perror says "Success", then the shmat is okay. And, your current test needs to be adjusted as above--I'd do that change regardless.
You could also do (e.g):
printf("ptr=%p\n",*ptr);
Normally, errno starts as 0.
Note that there are some differences between macOS and linux.
So, if errno is ever set to "too many open files", this can be because the process has too many open files (EMFILE).
It might be because the system-wide limit is reached (ENFILE) but that is "file table overflow".
Note that under linux shmat can not generate EMFILE. However, it appears that under macOS it can.
However, if the number of calls to shmat is limited [as you mention], the shmat should succeed.
The macOS man page is a little vague as to what the limit is based on. However, I checked the FreeBSD man page for shmat and that says it is limited by the sysctl parameter: kern.ipc.shmseg. Your grep should have caught that [if applicable].
It is possible some other syscall elsewhere in the code is opening too many files. And, that syscall is not checking the error return.
Again, I realize you're running macOS.
But, if available, you may want to try your program under linux. For example, it has much larger limits from the sysctl:
kernel.shm_next_id = -1
kernel.shm_rmid_forced = 0
kernel.shmall = 18446744073692774399
kernel.shmmax = 18446744073692774399
kernel.shmmni = 4096
vm.hugetlb_shm_group = 0
Note that shmmni is the system-wide maximum number of shared memory segments.
Note that for macOS, shmmni is 32 (vs. 4096 for linux)!?!?
That means that the entire system can only have 32 open shared memory segments for any/all processes???
That seems very low. You can probably set this to a larger number and see if that helps.
Linux has the strace program and you could use it to monitor the syscalls.
But, macOS has dtruss: How to trace system calls of a program in Mac OS X?

Using memcpy on mmap'ed region crashes, a for loop does not

I have an NVIDIA Tegra TK1 processor module on a carrier board with a PCI-e slot connecting to it. In that PCIe slot is an FPGA board which exposes some registers and a 64K memory area via PCIe.
On the ARM CPU of the Tegra board, a minimal Linux installation is running.
I am using /dev/mem and the mmap function to obtain user-space pointers to the register structs and the 64K memory area.
The distinct register files and the memory block are all assigned addresses which are aligned and do not overlap with regards to 4KB memory pages.
I explicitly map whole pages with mmap, using the result of getpagesize(), which also is 4096.
I can read/write from/to those exposed registers just fine.
I can read from the memory area (64KB), doing uint32 word-by-word reads in a for loop, just fine. I.e. read contents are correct.
But if I use std::memcpy on the same address range, though, the Tegra CPU freezes, always. I do not see any error message, if GDB is attached I also don't see a thing in Eclipse when trying to step over the memcpy line, it just stops hard. And I have to reset the CPU using the hardware reset button, as the remote console is frozen.
This is debug build with no optimization (-O0), using gcc-linaro-6.3.1-2017.05-i686-mingw32_arm-linux-gnueabihf. I was told the 64K region is accessible byte-wise, I did not try that explicitly.
Is there an actual (potential) problem that I need to worry about, or is there a specific reason why memcpy does not work and maybe should not be used in the first place in this scenario - and I can just carry on using my for loops and think nothing of it?
EDIT: Another effect has been observed: The original code snippet was missing a "vital" printf in the copying for loop, that came before the memory read. That removed, I don't get back valid data. I now updated the code snippet to have an extra read from the same address instead of the printf, which also yields correct data. The confusion intensifies.
Here the (I think) important excerpts of what's going on. With minor modifications, to make sense as shown, in this "de-fluffed" form.
// void* physicalAddr: PCIe "BAR0" address as reported by dmesg, added to the physical address offset of FPGA memory region
// long size: size of the physical region to be mapped
//--------------------------------
// doing the memory mapping
//
const uint32_t pageSize = getpagesize();
assert( IsPowerOfTwo( pageSize ) );
const uint32_t physAddrNum = (uint32_t) physicalAddr;
const uint32_t offsetInPage = physAddrNum & (pageSize - 1);
const uint32_t firstMappedPageIdx = physAddrNum / pageSize;
const uint32_t lastMappedPageIdx = (physAddrNum + size - 1) / pageSize;
const uint32_t mappedPagesCount = 1 + lastMappedPageIdx - firstMappedPageIdx;
const uint32_t mappedSize = mappedPagesCount * pageSize;
const off_t targetOffset = physAddrNum & ~(off_t)(pageSize - 1);
m_fileID = open( "/dev/mem", O_RDWR | O_SYNC );
// addr passed as null means: we supply pages to map. Supplying non-null addr would mean, Linux takes it as a "hint" where to place.
void* mapAtPageStart = mmap( 0, mappedSize, PROT_READ | PROT_WRITE, MAP_SHARED, m_fileID, targetOffset );
if (MAP_FAILED != mapAtPageStart)
{
m_userSpaceMappedAddr = (volatile void*) ( uint32_t(mapAtPageStart) + offsetInPage );
}
//--------------------------------
// Accessing the mapped memory
//
//void* m_rawData: <== m_userSpaceMappedAddr
//uint32_t* destination: points to a stack object
//int length: size in 32bit words of the stack object (a struct with only U32's in it)
// this crashes:
std::memcpy( destination, m_rawData, length * sizeof(uint32_t) );
// this does not, AND does yield correct memory contents - but only with a preceding extra read
for (int i=0; i<length; ++i)
{
// This extra read makes the data gotten in the 2nd read below valid.
// Commented out, the data read into destination will not be valid.
uint32_t tmp = ((const volatile uint32_t*)m_rawData)[i];
(void)tmp; //pacify compiler
destination[i] = ((const volatile uint32_t*)m_rawData)[i];
}
Based on the description, it looks like your FPGA code is not responding correctly to load instructions that are reading from locations on your FPGA and it is causing the CPU to lock up. It's not crashing it is permanently stalled, hence the need for the hard reset. I had this problem also when debugging my PCIE logic on an FPGA.
Another indication that your logic is not responding correctly is that you need an extra read in order to get the right responses.
Your loop is doing 32-bit loads but memcpy is doing at least 64-bit loads, which changes how your logic responds. For example, it will need to use two TLPs with 32 bits of response if the first 128 bits of the completion and the next 32 bits in the second 128 bit TLP of the completion.
What I found super-useful was to add logic to log all the PCIE transactions into an SRAM and to be able to dump the SRAM out to see how the logic was behaving or misbehaving. We have a nifty utility, pcieflat, that prints one PCIE TLP per line. It even has documentation.
When the PCIE interface is not working well enough, I stream the log to a UART in hex which can be decoded by pcieflat.
This tool is also useful for debugging performance problems -- you can look at how well your DMA reads and writes are pipelined.
Alternatively, if you have integrated logic analyzer or similar on the FPGA, you can trace the activity that way. But it's nicer to have the TLPs parsed according to PCIE protocol.

c/c++ libnetfilter_queue and application layer packets selection

I've got a c++ program using libnetfilter_queue library, designed to work on a Linux system.
I'd need to filter only application layer packets (hence, packets including payload for the application layer of the host).
I know that it's not possible with iptables, without rebuilding the kernel.
Since I can't do that on the final host device, I'm working from my c++ program.
My aim is to directly accept non-application layer packets and to process layer-7 packets.
I tried using the nfq_get_payload function, returning -1 if an error (hence, I suppose, no payload) if found.
ret = nfq_get_payload(tb, &data);
if (ret < 0) { /* accept packet */ }
else { /* process packet */ }
I know that the nfq_get_payload function depends on the "adopted mode" (see nfq_set_mode function), but it is not working to me.
How can I discriminate between application layer packets and "lower-layers" ones?
Knowing that the ip_src byte location is in data + 12 (see also here), since the TCP layer size is 64 bytes, the payload should be found, if available, in position data + 12 + 64.
unsigned char* pkt_payload = (buffer + 12 + 64));
Nevertheless, if I try to print the pkt_payload variable, it is not compliant to the expected results.
How can I solve it?

What could cause my packet's byte order to become partially scrambled?

I am sending packets over a TCP socket between a Linux Centos 4 machine and a Windows XP machine running Interix with Gentoo. When the packet is received by Interix, about 10% of the characters are consistently scrambled at the exact same offsets from the beginning of the packet. On the sending Linux side, the packet has this correct contents:
-----BEGIN PUBLIC KEY-----
MIIBojCCARcGByqGSM4+AgEwggEKAoGBAP//////////yQ/aoiFowjTExmKLgNwc
^ ^^^^^^^^^^^^^
0SkCTgiKZ8x0Agu+pjsTmyJRSgh5jjQE3e+VGbPNOkMbMCsKbfJfFDdP4TVtbVHC
^^^^^^^^
ReSFtXZiXn7G9ExC6aY37WsL/1y29Aa37e44a/taiZ+lrp8kEXxLH+ZJKGZR7OZT
gf//////////AgECAoGAf//////////kh+1RELRhGmJjMUXAbg5olIEnBEUz5joB
Bd9THYnNkSilBDzHGgJu98qM2eadIY2YFYU2+S+KG6fwmra2qOEi8kLauzEvP2N6
JiF00xv2tYX/rlt6A1v29xw1/a1Ez9LXT5IIviWP8ySUMyj2cynA//////////8D
gYQAAoGAKcjWmS+h/a6xY6HfNeVBk+vU4ZQoi4ROBT8NXdiFQUeLwT/WpE/8oAxn
KCOssVcoF54bF8JlEL0McWjQUzMrqoQedizALRRdH7kTUM/yqZZdxLgRFmiFDUXT
XxsFFB5hlLpMqy9lqpNMN8+e5m9ISgu8zHMlTBQXsnwds0VkbeU=
-----END PUBLIC KEY-----
But on Interix, the packet contents are slightly scrambled (but the majority is correct):
-----BEGIN PUBLIC KEY-----
MIIBojCCARcGByqGSM4+AgEwggEKAoGBAP//////y////iFowjTExQ/aomKLgNwc
^ ^^^^^^^^^^^^^
KigTCkS0Z8x0Agu+pjsTmyJRSgh5jjQE3e+VGbPNOkMbMCsKbfJfFDdP4TVtbVHC
^^^^^^^^
ReSFtXZiXn7G9ExC6aY37WsL/1y29Aa37e44a/taiZ+lrp8kEXxLH+ZJKGZR7OZT
gf//////////AgECAoGAf//////////kh+1RELRhGmJjMUXAbg5olIEnBEUz5joB
Bd9THYnNkSilBDzHGgJu98qM2eadIY2YFYU2+S+KG6fwmra2qOEi8kLauzEvP2N6
JiF00xv2tYX/rlt6A1v29xw1/a1Ez9LXT5IIviWP8ySUMyj2cynA//////////8D
gYQAAoGAKcjWmS+h/a6xY6HfNeVBk+vU4ZQoi4ROBT8NXdiFQUeLwT/WpE/8oAxn
KCOssVcoF54bF8JlEL0McWjQUzMrqoQedizALRRdH7kTUM/yqZZdxLgRFmiFDUXT
XxsFFB5hlLpMqy9lqpNMN8+e5m9ISgu8zHMlTBQXsnwds0VkbeU=
-----END PUBLIC KEY-----
I've pointed to the differences with the ^ characters above. There could be a couple more characters around the y given the repeated / would hide additional characters that were moved in that section.
This code works fine between several platform pairs:
Linux and Linux
Linux and BSD
Linux and Cygwin
Could this be a bug in the Interix and Gentoo code? I'm running on Windows XP, Interix v3.5. I notice that all the right characters are present, but their order is consistently scrambled, portions are reversed, others are cut and reinserted in a different place. The packet is being read on the receiving side with ::read() on the TCP socket file descriptor. There is lot of code in play here, so I'm not sure what portions would be most relavent to include, but will try and add more code if specific requests are made.
const int fd; // Passed in by caller.
char *buf; // Passed in by caller.
size_t want = count; // This value is 625 for the packet in question.
// As ::read() is called, got is adjusted, until the whole packet is read.
size_t got = 0;
while (got < want) {
// We call ::select() to ensure bytes are available before calling ::read().
ssize_t result = ::read(fd, buf, want - got);
if (result < 0) {
// Handle error (not getting called, so omitted).
} else {
if (result != 0) {
// We are coming in here in one try and got is set to 625, the amount we want...
// Not an error, increment the byte counter 'got' & the read pointer,
// buf.
got += result;
buf += result;
} else { // EOF because zero result from read.
eof = true; // Connection reset by peer.
break;
}
}
}
What experiments might I perform to help nail down where the error is coming from?
I would say you have a concurrency bug on 'buf', or possibly a duplicate free() or a re-use after free().
Mystery solved! The issue was that off_t was 32 bits wide on the windows XP machine and 64 bits wide on the Centos machine. When the packet is sent, its memory layout that includes some off_t objects is put from host into network byte order (little endian to big endian) then on the windows machine when it gets the packet, it goes back from network to host. Because the memory layout differed, I got the scrambling seen above.
I resolved the issue by using my own soff_t everywhere that is 64 bits wide.
However, I then ran into another issue where the compiler did not pack a structure the same way on both machines and on windows it inserted 4 bytes to make the long long 8 byte aligned, whereas on Centos it did not do this:
typedef struct Option
{
char[56] _otherStuff;
int _cpuFreq;
int _bufSize;
soff_t _fileSize; // Original bug fixed by forcing these 8 bytes wide.
soff_t _seekTo; // Original bug fixed by forcing these 8 bytes wide.
int _optionBits;
int _padding; // To fix next bug, I added this 4 bytes
long long _mtime;
long long _mode;
} __attribute__ ((aligned(1), packed)) Option;
I had used the __attribute__ ((aligned(1), packed)) to force the packing to be consistent and dense, but on Windows XP this was not or could not be honored. I solved this by adding the _padding to force the next 8 byte member to be 8 byte aligned on Centos and thus agree with Windows XP.

Problems with endianess on Raspberry Pi

I've just started on some raw network programming in C++ and have been compiling on my Raspberry Pi itself (no cross-compiling). That makes everything little endian.
After constructing my IP header, I calculate the IP checksum, but it was always coming out incorrect (based on an example here http://www.thegeekstuff.com/2012/05/ip-header-checksum/).
Revving up gdb, I've worked my issue down to the ordering of the first 32 bits in the IP header. The example uses 0x4500003C, which means version 4 (0x4), IHL 5 (0x5), TOS 0 (0x00), and tot_length 60 (0x003C). So I set my packet up the same.
struct iphdr* ip; // Also some mallocing
ip->version = 4;
ip->ihl = 5;
ip->tos = 0;
ip->tot_len = 60;
Now in gdb, I examined the first 32 bits, expecting 0x3C000045 because of endianness, but instead I get this:
(gdb) print ip
$1 = (iphdr *) 0x11018
(gdb) x/1xw 0x11018
0x11018: 0x003c0045
The first 16 bits are in little endian (0x0045) but the second, containing decimal 60, seem to be in big endian (0x003C)!
What is giving this? Am I crazy? Am I completely wrong about byte order inside structs? (It's a definite possibility)
There's the order of fields within the struct, and then there's the order of bytes within a multibyte field.
0x003C isn't endian at all, it's the hex value for 60. Sure, it's stored in memory with some endianness, but the order you used to write the field and the order you used to read it back out are the same -- both are the native byte order of the Raspberry Pi, and they cancel out.
Typically you will want to write:
ip->tot_len = htons(60);
when storing a 16-bit field into a packet. There's also htonl for 32-bit fields, and ntohs and ntohl for reading fields from network packets.
The ARM architecture can run both little and big endianess, but the Android platform runs little endian.