Sometimes MPI is used to send low-entropy data in messages. So it can be useful to try to compress messages before sending it. I know that MPI can work on very fast networks (10 Gbit/s and more), but many MPI programs are used with cheap network like 0,1G or 1Gbit/s Ethernet and with cheap (slow, low bisection) network switch. There is a very fast Snappy (wikipedia) compression algorithm, which has
Compression speed is 250 MB/s and decompression speed is 500 MB/s
so on compressible data and slow network it will give some speedup.
Is there any MPI library which can compress MPI messages (at layer of MPI; not the compression of ip packets like in PPP).
MPI messages are also structured, so there can be some special method, like compression of exponent part in array of double.
PS: There is also LZ4 compression method with comparable speed
I won't swear that there's none out there, but there's none in common use.
There's a couple of reason's why it's not common:
MPI is often used for sending lots of floating point data which is hard (but not impossible) to compress well, and often has relatively high entropy after a while.
In addition, MPI users are often as concerned with latency as bandwidth, and adding a compression/decompression step into the message-passing critical path wouldn't be attractive to those users.
Finally some operations (like reduction collectives, or scatter gather) would be very hard to implement efficiently with compression.
However, you sound like your use case could benefit from this for point-to-point communications, so there's no reason why you couldn't do it yourself. If you were going to send a message of size N and the receiver expected it then:
sender calls compression routine, receives buffer and new length M;
if M >= N, send the original data, with an initial byte of say 0, as N+1 bytes to the
receiver
otherwise sends an initial byte of 1 + compressed data
receiver receives data into length N+1 buffer
if first byte is 1, calls MPI_Get_count to determine amount of data received, calls
decompression routine
otherwises uses uncompressed data
I can't give you much guidance as to the compresion routines, but it does look like people have tried this before, eg http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.91.7936 .
I'll be happy to be told otherwise but I don't think many of us users of MPI are concerned with having a transport layer that compresses data.
Why the heck not ?
1) We already design our programs to do as little communication as possible, so we (like to think we) are sending the bare minimum across the interconnect.
2) The bulk of our larger messages comprise arrays of floating-point numbers which are relatively difficult (and therefore relatively expensive in time) to compress to any degree.
There's an ongoing project at the University of Edinburgh: http://link.springer.com/chapter/10.1007%2F978-3-642-32820-6_72?LI=true
Related
I want to save network bandwidth by using compression, such as bzip2 or gzip.
Attackers, as well as normal users, may send compressed messages.
Are there sequences of bytes which will cause some decompression functions to become stuck in an infinite loop, or to use vast amounts of memory?
Is so, is this a fundamental property of those algorithms, or just an implementation bug?
I can only speak for zlib's inflate. There is no input that would result in an infinite loop or uncontrolled memory consumption.
Since the maximum compression of deflate is less than 1032:1, then inflate when working normally can expand up to almost 1032:1. You just need to be able to handle that possibility.
I am getting data from a sensor(camera) and writing the data into a binary file. The problem is it takes lot of space on the disk.
So, I used the compression from boost (zlib) and the space reduced a lot! The problem is the compression process is slow and lots of data is missing.
So, I want to implement two threads, with one getting the data from the camera and writing the data into a buffer. The second thread will take the front data of the buffer and write it into the binary file. And in this case, all the data will be present.
How do I implement this buffer? It needs to expand dynamically and pop_front. Shall I use std::deque, or does something better already exist?
First, you have to consider these four rates (or speeds):
Speed of Production (SP): The average number of bytes your sensor produces per second.
Speed of Compression (SC): The average number of bytes per second you can compress. This is the number of input bytes to the compression algorithm.
Rate of Compression (RC): The average ratio of compressed data to uncompressed data your compress algorithm produces (ratio of size of output to the input of compression.) (This is obviously somewhere between 0 and 1.)
Speed of Writing (SW): The average number of bytes you can write to disk, per second.
If SC is less than SP, you are in trouble. It means you can't compress all the data you gather from your sensor, in real time. Which means you'll eventually run out of buffer memory. You'll have to find a faster compression algorithm, or dedicate more CPU cores to compression.
If SW is less than SP times RC (which is the size of sensor data after compression,) you are again in trouble. It means you can't write out your output data as fast as you are producing and compressing them, and again, you will eventually run out of buffer memory, no matter how much you have. You might be able to gain some speed by adopting a better write strategy or file system, but a real gain in SW comes from a better disk system (RAID, SSD, better hardware, etc.)
Now, if everything is OK speed-wise, you can probably employ something like the following architecture to read, compress and write the data out:
You'll have three threads (or two, described later) that do one part of the pipeline each. You'll also have two thread-safe queues, one for communication from each stage of the pipeline to the next.
Assuming the two queues are named Q1 and Q2, the high-level operation of the threads will look like this:
Input Thread:
Read K bytes of sensor data
Put the whole K bytes as a unit on Q1.
Go to 1.
Compression Thread:
Wait till there is something on Q1.
Pop one buffer of data (probably K bytes) from Q1.
Compress the buffer into a hopefully smaller buffer and put it on Q2.
Go to 1.
Output Thread:
Wait till there is something on Q2.
Pop one buffer of data from Q2.
Write the buffer to the output file.
Go to 1.
The most CPU-intensive part of the work is in the second thread, and the other two probably don't consume much CPU time and therefore probably can share a CPU core. This means that the above strategy may be runnable on two cores. But it can also run on a single core if the workload is light, or require many many cores. That all depends on the four rates I described up top.
Using asynchronous writes (e.g. IOCP on Windows or epoll on Linux,) you can drop the third thread and the second queue altogether. Then your second thread needs to execute something like this:
Wait till there is something on Q1.
Pop one buffer of data (probably K bytes) from Q1.
Compress the buffer into a hopefully smaller buffer.
Issue an asynchronous write request to the OS to write out the compressed buffer to disk.
Go to 1.
There are four more issues worth mentioning:
K should be selected so that the time required for various (usually constant time) activities associated with allocating a buffer, pushing it into and popping it from a thread-safe queue, starting a compression run and issuing a write request into a file become negligible relative to doing the actual work (reading sensor data, compressing bytes and writing to disk.) This usually means that K needs to be as large as possible. But if K is very large (many megabytes or hundreds of megabytes) then if your application crashes, you'll lose a lot of data. You need to find a balance between performance and risk of data loss. I suggest (without any knowledge of your specific needs and constraints) a value between 10KiB to 1MiB for K.
Implementing a thread-safe queue is easy if you have some knowledge and experience with concurrent/parallel programming, but rather hard and error-prone if you do not. Finding good examples and implementations should not be hard. A normal std::deque or std::list or std::anything won't be usable by itself, but can used as a good basis for writing a thread-safe queue.
Note that you are queuing buffers of data, not individual numbers or bytes. If you pass your data one number at a time through this pipeline, it will be painfully slow and wasteful.
Some compression algorithms are limited in how much data they can consume in each invocation, or that you must sync the output of each one call to compression routine with one call to the decompression routine later on. These might affect the choice of K, and also how you write your output file. You might have to add some metadata so that you can be able to actually decompress and read the data later.
I use linux 3.x and modern glibc(2.19).
I would like send several Ethernet frames without switch from kernel/user space forth and back.
I have MTU = 1500, and I want to send 800 KB.
I init receiver address like this:
struct sockaddr_ll socket_address;
socket_address.sll_ifindex = if_idx.ifr_ifindex;
socket_address.sll_halen = ETH_ALEN;
socket_address.sll_addr[0] = MY_DEST_MAC0;
//...
After that I can call sendto/sendmsg 800KB / 1500 ~= 500 times and all works fine, but this require user space <-> kernel negotiation ~ 500 * 25 times per second. I want avoid it.
I try to init struct msghdr::msg_iov with appropriate info,
but get error "message too long", looks like msghdr::msg_iov can not describe something with size > MTU.
So question is it possible to send many raw Ethernet frame on Linux from userspace at once?
PS
The data (800KB) I get from file, and read it to memory. So struct iovec good for me, I can create suitable amount of Ethernet header and have to iovec per 1500 packet, one point to data, one point to Ethernet header.
Whoa.
My last company made realtime hidef video encoding hardware. In the lab, we had to blast 200MB / second across a bonded link, so I have some experience with this. What follows is based upon that.
Before you can tune, you must measure. You don't want to do multiple syscalls, but can you prove with timing measurement that the overhead is significant?
I use a wrapper routine around clock_gettime that gives back time of day with nanosecond precision (e.g. (tv_sec * 100000000) + tv_nsec). Call this [herein] "nanotime".
So, for any given syscall, you need a measurement:
tstart = nanotime();
syscall();
tdif = nanotime() - tstart;
For send/sendto/sendmsg/write, do this will small data so you're sure you're not blocking [or use O_NONBLOCK, if applicable]. This gives you the syscall overhead
Why are you going directly to ethernet frames? TCP [or UDP] is usually fast enough and modern NIC cards can do the envelope wrap/strip in hardware. I'd like to know if there is a specific situation that requires ethernet frames, or was it that you weren't getting the performance you wanted and came up with this as a solution. Remember, you're doing 800KB/s (~1MB/s) and my project was doing 100x-200x more than that over TCP.
What about using two plain write calls to the socket? One for header, one for data [all 800KB]. write can be used on a socket and doesn't have the EMSGSIZE error or restriction.
Further, why do you need your header to be in a separate buffer? When you allocate your buffer, just do:
datamax = 800 * 1024; // or whatever
buflen = sizeof(struct header) + datamax;
buf = malloc(buflen);
while (1) {
datalen = read(fdfile,&buf[sizeof(struct header)],datamax);
// fill in header ...
write(fdsock,buf,sizeof(struct header) + datalen);
}
This works even for the ethernet frame case.
One of the things can also do is use a setsockopt to increase the size of the kernel buffer for your socket. Otherwise, you can send data, but it will be dropped in the kernel before the receiver can drain it. More on this below.
To measure the performance of the wire, add some fields to your header:
u64 send_departure_time; // set by sender from nanotime
u64 recv_arrival_time; // set by receiver when the packet arrives
So, sender sets departure time and does write [just do the header for this test]. Call this packet Xs. receiver stamps this when it arrives. receiver immediately sends back a message to sender [call it Xr] with a departure stamp that and the contents of Xs. When sender gets this, it stamps it with an arrival time.
With the above we now have:
T1 -- time packet Xs departed sender
T2 -- time packet Xs arrived at receiver
T3 -- time packet Xr departed receiver
T4 -- time packet Xr arrived at sender
Assuming you do this on a relatively quiet connection with little to no other traffic and you know the link speed (e.g. 1 Gb/s), with T1/T2/T3/T4 you can calculate the overhead.
You can repeat the measurement for TCP/UDP vs ETH. You may find that it doesn't buy you as much as you think. Once again, can you prove it with precise measurement?
I "invented" this algorithm while working at the aforementioned company, only to find out that it was already part of a video standard for sending raw video across a 100Gb Ethernet NIC card and the NIC does the timestamping in hardware.
One of the other things you may have to do is add some throttle control. This is similar to what bittorrent does or what the PCIe bus does.
When PCIe bus nodes first start up, they communicate how much free buffer space they have available for "blind write". That is, the sender is free to blast up to this much, without any ACK message. As the receiver drains its input buffer, it sends periodic ACK messages to the sender with the number of bytes it was able to drain. sender can add this value back to the blind write limit and keep going.
For your purposes, the blind write limit is the size of the receiver's kernel socket buffer.
UPDATE
Based upon some of the additional information from your comments [the actual system configuration should go, in a more complete form, as an edit to your question at the bottom].
You do have a need for a raw socket and sending an ethernet frame. You can reduce the overhead by setting a larger MTU via ifconfig or an ioctl call with SIOCSIFMTU. I recommend the ioctl. You may not need to set MTU to 800KB. Your CPU's NIC card has a practical limit. You can probably increase MTU from 1500 to 15000 easily enough. This would reduce syscall overhead by 10x and that may be "good enough".
You probably will have to use sendto/sendmsg. The two write calls were based on conversion to TCP/UDP. But, I suspect sendmsg with msg_iov will have more overhead than sendto. If you search, you'll find that most example code for what you want uses sendto. sendmsg seems like less overhead for you, but it may cause more overhead for the kernel. Here's an example that uses sendto: http://hacked10bits.blogspot.com/2011/12/sending-raw-ethernet-frames-in-6-easy.html
In addition to improving syscall overhead, larger MTU might improve the efficiency of the "wire", even though this doesn't seem like a problem in your use case. I have experience with CPU + FPGA systems and communicating between them, but I am still puzzled by one of your comments about "not using a wire". FPGA connected to ethernet pins of CPU I get--sort of. More precisely, do you mean FPGA pins connected to ethernet pins of NIC card/chip of CPU"?
Are the CPU/NIC on the same PC board and the FPGA pins are connected via PC board traces? Otherwise, I don't understand "not using a wire".
However, once again, I must say that you must be able to measure your performance before you blindly try to improve it.
Have you run the test case I suggested for determining the syscall overhead? If it is small enough, trying to optimize for it may not be worth it and doing so may actually hurt performance more severely in other areas that you didn't realize when you started.
As an example, I once worked on a system that had a severe performance problem, such that, the system didn't work. I suspected the serial port driver was slow, so I recoded from a high level language (e.g. like C) into assembler.
I increased the driver performance by 2x, but it contributed less than a 5% performance improvement to the system. It turned out the real problem was that other code was accessing non-existent memory which just caused a bus timeout, slowing the system down measurably [it did not generate an interrupt that would have made it easy to find as on modern systems].
That's when I learned the importance of measurement. I had done my optimization based on an educated guess, rather than hard data. After that: lesson learned!
Nowadays, I never try large optimization until I can measure first. In some cases, I add an optimization that I'm "sure" will make things better (e.g. inlining a function). When I measure it [and because I can measure it], I find out that the new code is actually slower and I have to revert the change. But, that's the point: I can prove/disprove this with hard performance data.
What CPU are you using: x86, arm, mips, etc. At what clock frequency? How much DRAM? How many cores?
What FPGA are you using (e.g. Xilinx, Altera)? What specific type/part number? What is the maximum clock rate? Is the FPGA devoted entirely to logic or do you also have a CPU inside it such as microblaze, nios, arm? Does the FPGA have access to DRAM of it's own [and how much DRAM]?
If you increase the MTU, can the FPGA handle it, from either a buffer/space standpoint or a clock speed standpoint??? If you increase MTU, you may need to add an ack/sync protocol as I suggested in the original post.
Currently, the CPU is doing a blind write of the data, hoping the FPGA can handle it. This means you have an open race condition between CPU and FPGA.
This may be mitigated, purely as a side effect of sending small packets. If you increase MTU too much, you might overwhelm the FPGA. In other words, it is the very overhead you're trying to optimize away, that allows the FPGA to keep up with the data rate.
This is what I meant by unintended consequences of blind optimization. It can have unintended and worse side effects.
What is the nature of the data being sent to the FPGA? You're sending 800KB, but how often?
I am assuming that it is not the FPGA firmware itself for a few reasons. You said the firmware was already almost full [and it is receiving the ethernet data]. Also, firmware is usually loaded via the I2C bus, a ROM, or an FPGA programmer. So, am I correct?
You're sending the data to the FPGA from a file. This implies that it is only being sent once, at the startup of your CPU's application. Is that correct? If so, optimization is not needed because it's an init/startup cost that has little impact on the running system.
So, I have to assume that the file gets loaded many times, possibly a different file each time. Is that correct? If so, you may need to consider the impact of the read syscall. Not just from syscall overhead, but optimal read length. For example, IIRC, the optimal transfer size for a disk-to-disk or file-to-file copy/transfer is 64KB, depending upon the filesystem or underlying disk characteristics.
So, if you're looking to reduce overhead, reading data from a file may have considerably more than having the application generate the data [if that's possible].
The kernel syscall interface is designed to be very low overhead. Kernel programmers [I happen to be one] spend a great deal of time ensuring the overhead is low.
You say your system is utilizing the a lot of CPU time for other things. Can you measure the other things? How is your application structured? How many processes? How many threads? How do they communicate? What is the latency/througput? You may be able to find [can quite probably find] the larger bottlenecks and recode those and you'll get an overall reduction in CPU usage that far exceeds the maximum benefit you'll get from the MTU tweak.
Trying to optimize the syscall overhead may be like my serial port optimization. A lot of effort, and yet the overall results are/were disappointing.
When considering performance, it is important to consider it from an overall system standpoint. In your case, this means CPU, FPGA, and anything else in it.
You say that the CPU is doing a lot of things. Could/should some of those algorithms go into the FPGA? Is the reason they're not because the FPGA is almost out of space, otherwise you would? Is the FPGA firmware 100% done? Or, is there more RTL to be written? If you're at 90% space utilization in the FPGA, and you'll need more RTL, you may wish to consider going to an FPGA part that has more space for logic, possibly with a higher clock rate.
In my video company, we used FPGAs. We used the largest/fastest state-of-the-art part the FPGA vendor had. We also used virtually 100% of the space for logic and required the part's maximum clock rate. We were told by the vendor that we were the largest consumer of FPGA resources of any of their client companies worldwide. Because of this, we were straining the vendors development tools. Place-and-route would frequently fail and have to be rerun to get correct placement and meet timing.
So, when an FPGA is almost full with logic, the place-and-route can be difficult to achieve. It might be a reason to consider a larger part [if possible]
In terms of low latency (I am thinking about financial exchanges/co-location- people who care about microseconds) what options are there for sending packets from a C++ program on two Unix computers?
I have heard about kernel bypass network cards, but does this mean you program against some sort of API for the card? I presume this would be a faster option in comparison to using the standard Unix berkeley sockets?
I would really appreciate any contribution, especially from persons who are involved in this area.
EDITED from milliseconds to microseconds
EDITED I am kinda hoping to receive answers based more upon C/C++, rather than network hardware technologies. It was intended as a software question.
UDP sockets are fast, low latency, and reliable enough when both machines are on the same LAN.
TCP is much slower than UDP but when the two machines are not on the same LAN, UDP is not reliable.
Software profiling will stomp obvious problems with your program. However, when you are talking about network performance, network latency is likely to be you largest bottleneck. If you are using TCP, then you want to do things that avoid congestion and loss on your network to prevent retransmissions. There are a few things to do to cope:
Use a network with bandwidth and reliability guarantees.
Properly size your TCP parameters to maximize utilization without incurring loss.
Use error correction in your data transmission to correct for the small amount of loss you might encounter.
Or you can avoid using TCP altogether. But if reliability is required, you will end up implementing much of what is already in TCP.
But, you can leverage existing projects that have already thought through a lot of these issues. The UDT project is one I am aware of, that seems to be gaining traction.
At some point in the past, I worked with a packet sending driver that was loaded into the Windows kernel. Using this driver it was possible to generate stream of packets something 10-15 times stronger (I do not remember exact number) than from the app that was using the sockets layer.
The advantage is simple: The sending request comes directly from the kernel and bypasses multiple layers of software: sockets, protocol (even for UDP packet simple protocol driver processing is still needed), context switch, etc.
Usually reduced latency comes at a cost of reduced robustness. Compare for example the (often greatly advertised) fastpath option for ADSL. The reduced latency due to shorter packet transfer times comes at a cost of increased error susceptibility. Similar technologies migt exist for a large number of network media. So it very much depends on the hardware technologies involved. Your question suggests you're referring to Ethernet, but it is unclear whether the link is Ethernet-only or something else (ATM, ADSL, …), and whether some other network technology would be an option as well. It also very much depends on geographical distances.
EDIT:
I got a bit carried away with the hardware aspects of this question. To provide at least one aspect tangible at the level of application design: have a look at zero-copy network operations like sendfile(2). They can be used to eliminate one possible cause of latency, although only in cases where the original data came from some source other than the application memory.
As my day job, I work for a certain stock exchange. Below answer is my own opinion from the software solutions which we provide exactly for this kind of high throughput low latency data transfer. It is not intended in any way to be taken as marketing pitch(please i am a Dev.)This is just to give what are the Essential components of the software stack in this solution for this kind of fast data( Data could be stock/trading market data or in general any data):-
1] Physical Layer - Network interface Card in case of a TCP-UDP/IP based Ethernet network, or a very fast / high bandwidth interface called Infiniband Host Channel Adaptor. In case of IP/Ethernet software stack, is part of the OS. For Infiniband the card manufacturer (Intel, Mellanox) provide their Drivers, Firmware and API library against which one has to implement the socket code(Even infiniband uses its own 'socketish' protocol for network communications between 2 nodes.
2] Next layer above the physical layer we have is a Middleware which basically abstracts the lower network protocol nittigritties, provides some kind of interface for data I/O from physical layer to application layer. This layer also provides some kind of network data quality assurance (IF using tCP)
3] Last layer would be a application which we provide on top of middleware. Any one who gets 1] and 2] from us, can develop a low latency/hight throughput 'data transfer of network' kind of app for stock trading, algorithmic trading kind os applications using a choice of programming language interfaces - C,C++,Java,C#.
Basically a client like you can develop his own application in C,C++ using the APIs we provide, which will take care of interacting with the NIC or HCA(i.e. the actual physical network interface) to send and receive data fast, really fast.
We have a comprehensive solution catering to different quality and latency profiles demanded by our clients - Some need Microseconds latency is ok but they need high data quality/very little errors; Some can tolerate a few errors, but need nano seconds latency, Some need micro seconds latency, no errors tolerable, ...
If you need/or are interested in any way in this kind of solution , ping me offline at my contacts mentioned here at SO.
I am writing c++ socket code and i need some help !
in my program i don't know what the size of the message would be , it may send a part of file or the file it self ,the file may be a huge file , so shall i specify a maximum size for the packet , shall i divide it to more than one if exceeded the maximum ?
It's never constructive to think about "packets" and "messages" when using TCP because:
The network engine has its own ways of deciding the best segment size
Segment size makes no difference to application code: the receiving TPC is free to coalesce segments before passing data to the receiving process
You should view TCP the way it was designed: reliable in order bytes-stream service. So just write large-enough blocks and the engine and its myriad of rules will take care of it.
The problem is a little vague, but the approach seems universal. The transmitter should send an indication of how many bytes the receiver should expect. The receiver should expect to see this indication, and then prepare to receive that many bytes.
As far as packet size, generally an application does not worry about how the bytes are delivered on the network per se, but the application may care about not calling send and recv system calls too many times. This is particularly important on a concurrent server, when efficiency is key to scalability. So, you want a buffer that is big enough to avoid making too many system calls, but not so big as to cause you to block for a long time waiting for the data to drain into the kernel buffer. Matching the send/recv socket buffer size is usually sufficient for that, but it depends on other factors, like the bandwidth and latency of the network, and how quickly the receiver is draining the data, and the timeslice you want to allow per connection being handled during concurrency.