What could cause my packet's byte order to become partially scrambled? - c++

I am sending packets over a TCP socket between a Linux Centos 4 machine and a Windows XP machine running Interix with Gentoo. When the packet is received by Interix, about 10% of the characters are consistently scrambled at the exact same offsets from the beginning of the packet. On the sending Linux side, the packet has this correct contents:
-----BEGIN PUBLIC KEY-----
MIIBojCCARcGByqGSM4+AgEwggEKAoGBAP//////////yQ/aoiFowjTExmKLgNwc
^ ^^^^^^^^^^^^^
0SkCTgiKZ8x0Agu+pjsTmyJRSgh5jjQE3e+VGbPNOkMbMCsKbfJfFDdP4TVtbVHC
^^^^^^^^
ReSFtXZiXn7G9ExC6aY37WsL/1y29Aa37e44a/taiZ+lrp8kEXxLH+ZJKGZR7OZT
gf//////////AgECAoGAf//////////kh+1RELRhGmJjMUXAbg5olIEnBEUz5joB
Bd9THYnNkSilBDzHGgJu98qM2eadIY2YFYU2+S+KG6fwmra2qOEi8kLauzEvP2N6
JiF00xv2tYX/rlt6A1v29xw1/a1Ez9LXT5IIviWP8ySUMyj2cynA//////////8D
gYQAAoGAKcjWmS+h/a6xY6HfNeVBk+vU4ZQoi4ROBT8NXdiFQUeLwT/WpE/8oAxn
KCOssVcoF54bF8JlEL0McWjQUzMrqoQedizALRRdH7kTUM/yqZZdxLgRFmiFDUXT
XxsFFB5hlLpMqy9lqpNMN8+e5m9ISgu8zHMlTBQXsnwds0VkbeU=
-----END PUBLIC KEY-----
But on Interix, the packet contents are slightly scrambled (but the majority is correct):
-----BEGIN PUBLIC KEY-----
MIIBojCCARcGByqGSM4+AgEwggEKAoGBAP//////y////iFowjTExQ/aomKLgNwc
^ ^^^^^^^^^^^^^
KigTCkS0Z8x0Agu+pjsTmyJRSgh5jjQE3e+VGbPNOkMbMCsKbfJfFDdP4TVtbVHC
^^^^^^^^
ReSFtXZiXn7G9ExC6aY37WsL/1y29Aa37e44a/taiZ+lrp8kEXxLH+ZJKGZR7OZT
gf//////////AgECAoGAf//////////kh+1RELRhGmJjMUXAbg5olIEnBEUz5joB
Bd9THYnNkSilBDzHGgJu98qM2eadIY2YFYU2+S+KG6fwmra2qOEi8kLauzEvP2N6
JiF00xv2tYX/rlt6A1v29xw1/a1Ez9LXT5IIviWP8ySUMyj2cynA//////////8D
gYQAAoGAKcjWmS+h/a6xY6HfNeVBk+vU4ZQoi4ROBT8NXdiFQUeLwT/WpE/8oAxn
KCOssVcoF54bF8JlEL0McWjQUzMrqoQedizALRRdH7kTUM/yqZZdxLgRFmiFDUXT
XxsFFB5hlLpMqy9lqpNMN8+e5m9ISgu8zHMlTBQXsnwds0VkbeU=
-----END PUBLIC KEY-----
I've pointed to the differences with the ^ characters above. There could be a couple more characters around the y given the repeated / would hide additional characters that were moved in that section.
This code works fine between several platform pairs:
Linux and Linux
Linux and BSD
Linux and Cygwin
Could this be a bug in the Interix and Gentoo code? I'm running on Windows XP, Interix v3.5. I notice that all the right characters are present, but their order is consistently scrambled, portions are reversed, others are cut and reinserted in a different place. The packet is being read on the receiving side with ::read() on the TCP socket file descriptor. There is lot of code in play here, so I'm not sure what portions would be most relavent to include, but will try and add more code if specific requests are made.
const int fd; // Passed in by caller.
char *buf; // Passed in by caller.
size_t want = count; // This value is 625 for the packet in question.
// As ::read() is called, got is adjusted, until the whole packet is read.
size_t got = 0;
while (got < want) {
// We call ::select() to ensure bytes are available before calling ::read().
ssize_t result = ::read(fd, buf, want - got);
if (result < 0) {
// Handle error (not getting called, so omitted).
} else {
if (result != 0) {
// We are coming in here in one try and got is set to 625, the amount we want...
// Not an error, increment the byte counter 'got' & the read pointer,
// buf.
got += result;
buf += result;
} else { // EOF because zero result from read.
eof = true; // Connection reset by peer.
break;
}
}
}
What experiments might I perform to help nail down where the error is coming from?

I would say you have a concurrency bug on 'buf', or possibly a duplicate free() or a re-use after free().

Mystery solved! The issue was that off_t was 32 bits wide on the windows XP machine and 64 bits wide on the Centos machine. When the packet is sent, its memory layout that includes some off_t objects is put from host into network byte order (little endian to big endian) then on the windows machine when it gets the packet, it goes back from network to host. Because the memory layout differed, I got the scrambling seen above.
I resolved the issue by using my own soff_t everywhere that is 64 bits wide.
However, I then ran into another issue where the compiler did not pack a structure the same way on both machines and on windows it inserted 4 bytes to make the long long 8 byte aligned, whereas on Centos it did not do this:
typedef struct Option
{
char[56] _otherStuff;
int _cpuFreq;
int _bufSize;
soff_t _fileSize; // Original bug fixed by forcing these 8 bytes wide.
soff_t _seekTo; // Original bug fixed by forcing these 8 bytes wide.
int _optionBits;
int _padding; // To fix next bug, I added this 4 bytes
long long _mtime;
long long _mode;
} __attribute__ ((aligned(1), packed)) Option;
I had used the __attribute__ ((aligned(1), packed)) to force the packing to be consistent and dense, but on Windows XP this was not or could not be honored. I solved this by adding the _padding to force the next 8 byte member to be 8 byte aligned on Centos and thus agree with Windows XP.

Related

Using memcpy on mmap'ed region crashes, a for loop does not

I have an NVIDIA Tegra TK1 processor module on a carrier board with a PCI-e slot connecting to it. In that PCIe slot is an FPGA board which exposes some registers and a 64K memory area via PCIe.
On the ARM CPU of the Tegra board, a minimal Linux installation is running.
I am using /dev/mem and the mmap function to obtain user-space pointers to the register structs and the 64K memory area.
The distinct register files and the memory block are all assigned addresses which are aligned and do not overlap with regards to 4KB memory pages.
I explicitly map whole pages with mmap, using the result of getpagesize(), which also is 4096.
I can read/write from/to those exposed registers just fine.
I can read from the memory area (64KB), doing uint32 word-by-word reads in a for loop, just fine. I.e. read contents are correct.
But if I use std::memcpy on the same address range, though, the Tegra CPU freezes, always. I do not see any error message, if GDB is attached I also don't see a thing in Eclipse when trying to step over the memcpy line, it just stops hard. And I have to reset the CPU using the hardware reset button, as the remote console is frozen.
This is debug build with no optimization (-O0), using gcc-linaro-6.3.1-2017.05-i686-mingw32_arm-linux-gnueabihf. I was told the 64K region is accessible byte-wise, I did not try that explicitly.
Is there an actual (potential) problem that I need to worry about, or is there a specific reason why memcpy does not work and maybe should not be used in the first place in this scenario - and I can just carry on using my for loops and think nothing of it?
EDIT: Another effect has been observed: The original code snippet was missing a "vital" printf in the copying for loop, that came before the memory read. That removed, I don't get back valid data. I now updated the code snippet to have an extra read from the same address instead of the printf, which also yields correct data. The confusion intensifies.
Here the (I think) important excerpts of what's going on. With minor modifications, to make sense as shown, in this "de-fluffed" form.
// void* physicalAddr: PCIe "BAR0" address as reported by dmesg, added to the physical address offset of FPGA memory region
// long size: size of the physical region to be mapped
//--------------------------------
// doing the memory mapping
//
const uint32_t pageSize = getpagesize();
assert( IsPowerOfTwo( pageSize ) );
const uint32_t physAddrNum = (uint32_t) physicalAddr;
const uint32_t offsetInPage = physAddrNum & (pageSize - 1);
const uint32_t firstMappedPageIdx = physAddrNum / pageSize;
const uint32_t lastMappedPageIdx = (physAddrNum + size - 1) / pageSize;
const uint32_t mappedPagesCount = 1 + lastMappedPageIdx - firstMappedPageIdx;
const uint32_t mappedSize = mappedPagesCount * pageSize;
const off_t targetOffset = physAddrNum & ~(off_t)(pageSize - 1);
m_fileID = open( "/dev/mem", O_RDWR | O_SYNC );
// addr passed as null means: we supply pages to map. Supplying non-null addr would mean, Linux takes it as a "hint" where to place.
void* mapAtPageStart = mmap( 0, mappedSize, PROT_READ | PROT_WRITE, MAP_SHARED, m_fileID, targetOffset );
if (MAP_FAILED != mapAtPageStart)
{
m_userSpaceMappedAddr = (volatile void*) ( uint32_t(mapAtPageStart) + offsetInPage );
}
//--------------------------------
// Accessing the mapped memory
//
//void* m_rawData: <== m_userSpaceMappedAddr
//uint32_t* destination: points to a stack object
//int length: size in 32bit words of the stack object (a struct with only U32's in it)
// this crashes:
std::memcpy( destination, m_rawData, length * sizeof(uint32_t) );
// this does not, AND does yield correct memory contents - but only with a preceding extra read
for (int i=0; i<length; ++i)
{
// This extra read makes the data gotten in the 2nd read below valid.
// Commented out, the data read into destination will not be valid.
uint32_t tmp = ((const volatile uint32_t*)m_rawData)[i];
(void)tmp; //pacify compiler
destination[i] = ((const volatile uint32_t*)m_rawData)[i];
}
Based on the description, it looks like your FPGA code is not responding correctly to load instructions that are reading from locations on your FPGA and it is causing the CPU to lock up. It's not crashing it is permanently stalled, hence the need for the hard reset. I had this problem also when debugging my PCIE logic on an FPGA.
Another indication that your logic is not responding correctly is that you need an extra read in order to get the right responses.
Your loop is doing 32-bit loads but memcpy is doing at least 64-bit loads, which changes how your logic responds. For example, it will need to use two TLPs with 32 bits of response if the first 128 bits of the completion and the next 32 bits in the second 128 bit TLP of the completion.
What I found super-useful was to add logic to log all the PCIE transactions into an SRAM and to be able to dump the SRAM out to see how the logic was behaving or misbehaving. We have a nifty utility, pcieflat, that prints one PCIE TLP per line. It even has documentation.
When the PCIE interface is not working well enough, I stream the log to a UART in hex which can be decoded by pcieflat.
This tool is also useful for debugging performance problems -- you can look at how well your DMA reads and writes are pipelined.
Alternatively, if you have integrated logic analyzer or similar on the FPGA, you can trace the activity that way. But it's nicer to have the TLPs parsed according to PCIE protocol.

c/c++ libnetfilter_queue and application layer packets selection

I've got a c++ program using libnetfilter_queue library, designed to work on a Linux system.
I'd need to filter only application layer packets (hence, packets including payload for the application layer of the host).
I know that it's not possible with iptables, without rebuilding the kernel.
Since I can't do that on the final host device, I'm working from my c++ program.
My aim is to directly accept non-application layer packets and to process layer-7 packets.
I tried using the nfq_get_payload function, returning -1 if an error (hence, I suppose, no payload) if found.
ret = nfq_get_payload(tb, &data);
if (ret < 0) { /* accept packet */ }
else { /* process packet */ }
I know that the nfq_get_payload function depends on the "adopted mode" (see nfq_set_mode function), but it is not working to me.
How can I discriminate between application layer packets and "lower-layers" ones?
Knowing that the ip_src byte location is in data + 12 (see also here), since the TCP layer size is 64 bytes, the payload should be found, if available, in position data + 12 + 64.
unsigned char* pkt_payload = (buffer + 12 + 64));
Nevertheless, if I try to print the pkt_payload variable, it is not compliant to the expected results.
How can I solve it?

hidapi: Sending packet smaller than caps.OutputReportByteLength

I am working with a device (the wiimote) that takes commands through the DATA pipe, and only accepts command packets that are EXACTLY as long as the command itself. For example, it will accept:
0x11 0x10
but it will not accept:
0x11 0x10 0x00 0x00 0x00 ... etc.
This is a problem on windows, as WriteFile() on windows requires that the byte[] passed to it is at least as long as caps.OutputReportByteLength. On mac, where this limitation isn't present, my code works correctly. Here is the code from hid.c that causes this issue:
/* Make sure the right number of bytes are passed to WriteFile. Windows
expects the number of bytes which are in the _longest_ report (plus
one for the report number) bytes even if the data is a report
which is shorter than that. Windows gives us this value in
caps.OutputReportByteLength. If a user passes in fewer bytes than this,
create a temporary buffer which is the proper size. */
if (length >= dev->output_report_length) {
/* The user passed the right number of bytes. Use the buffer as-is. */
buf = (unsigned char *) data;
} else {
/* Create a temporary buffer and copy the user's data
into it, padding the rest with zeros. */
buf = (unsigned char *) malloc(dev->output_report_length);
memcpy(buf, data, length);
memset(buf + length, 0, dev->output_report_length - length);
length = dev->output_report_length;
}
res = WriteFile(dev->device_handle, buf, length, NULL, &ol);
Removing the above code, as mentioned in the comments, results in an error from WriteFile().
Is there any way that I can pass data to the device of arbitrary size? Thanks in advance for any assistance.
Solved. I used a solution similar to the guys over at Dolphin, a Wii emulator. Apparently, on the Microsoft bluetooth stack, WriteFile() doesn't work correctly, causing the Wiimote to return with an error. By using HidD_SetOutputReport() on the MS stack and WriteFile() on the BlueSoleil stack, I was able to successfully connect to the device (at least on my machine).
I haven't tested this on the BlueSoleil stack, but Dolphin is using this method so it is safe to say it works.
Here is a gist containing an ugly implementation of this fix:
https://gist.github.com/Flafla2/d261a156ea2e3e3c1e5c

Problems with endianess on Raspberry Pi

I've just started on some raw network programming in C++ and have been compiling on my Raspberry Pi itself (no cross-compiling). That makes everything little endian.
After constructing my IP header, I calculate the IP checksum, but it was always coming out incorrect (based on an example here http://www.thegeekstuff.com/2012/05/ip-header-checksum/).
Revving up gdb, I've worked my issue down to the ordering of the first 32 bits in the IP header. The example uses 0x4500003C, which means version 4 (0x4), IHL 5 (0x5), TOS 0 (0x00), and tot_length 60 (0x003C). So I set my packet up the same.
struct iphdr* ip; // Also some mallocing
ip->version = 4;
ip->ihl = 5;
ip->tos = 0;
ip->tot_len = 60;
Now in gdb, I examined the first 32 bits, expecting 0x3C000045 because of endianness, but instead I get this:
(gdb) print ip
$1 = (iphdr *) 0x11018
(gdb) x/1xw 0x11018
0x11018: 0x003c0045
The first 16 bits are in little endian (0x0045) but the second, containing decimal 60, seem to be in big endian (0x003C)!
What is giving this? Am I crazy? Am I completely wrong about byte order inside structs? (It's a definite possibility)
There's the order of fields within the struct, and then there's the order of bytes within a multibyte field.
0x003C isn't endian at all, it's the hex value for 60. Sure, it's stored in memory with some endianness, but the order you used to write the field and the order you used to read it back out are the same -- both are the native byte order of the Raspberry Pi, and they cancel out.
Typically you will want to write:
ip->tot_len = htons(60);
when storing a 16-bit field into a packet. There's also htonl for 32-bit fields, and ntohs and ntohl for reading fields from network packets.
The ARM architecture can run both little and big endianess, but the Android platform runs little endian.

Arduino Ethernet Byte size problem

I'm using an Arduino (duemilanove) with the official Ethernet shield to send data to the controller for controlling an LED matrix. I am trying to send some raw 32-bit unsigned int values (unix timestamps) to the controller by taking the 4 bytes in the 32-bit value on the desktop and sending it to the arduino as 4 consecutive bytes. However, whenever a byte value is larger than 127, the returned value by the ethernet client library is 63.
The following is a basic example of what I'm doing on the arduino side of things. Some things have been removed for neatness.
byte buffer[32];
memset(buffer, 0, 32);
int data;
int i=0;
data = client.read();
while(data != -1 && i < 32)
{
buffer[i++] = (byte)data;
data = client.read();
}
So, whenever the input byte is bigger than 127 the variable "data" will end up getting set to 63! At first I thought the problem was further down the line (buffer used to be char instead of byte) but when I print out "data" right after the read, it's still 63.
Any ideas what could be causing this? I know client.read() is supposed to output int and internally reads data from the socket as uint8_t which is a full byte and unsigned, so I should be able to at least go to 255...
EDIT: Right you are, Hans. Didn't realize that Encoding.ASCII.GetBytes only supported the first 7 bits and not all 8.
I'm more inclined to suspect the transmit side. Are you positive the transmit side is working correctly? Have you verified with a wireshark capture or some such?
63 is the ASCII code for ?. There's some relevance to the values, ASCII doesn't have character codes for values over 127. An ASCII encoder commonly replaces invalid codes like this with a question mark. Default behavior for the .NET Encoding.ASCII encoder for example.
It isn't exactly clear where that might happen. Definitely not in your snippet. Probably on the other end of the wire. Write bytes, not characters.
+1 for Hans Passant and Karl Bielefeldt.
Can you just send the data without encoding? How is the data being sent? TCP/UDP/IP/Ethernet definitely support sending binary data without restriction. If this isn't possible, perhaps converting the data to hex will solve the problem. Base64 will also work (better) but is considerably more work. For small amounts of data, hex is probably the easiest and fastest solution.
+1 again to Karl and Ben for mentioning wireshark. Invaluable for debugging network problems like this.