libpcap to capture 10 Gbps NIC - c++

I want to capture packets from 10Gbps network card with 0 packet loss.
I am using lipcap for 100Mbps NIC and it is working fine.
Will libpcap be able to handle 10Gbps NIC traffic?
If not what are the other alternative ways to achive this?

Whether or not libpcap will handle 10Gbps with 0 packet loss is a matter of the machine that you are using and libpcap version. If the machine, CPU and HDD I/O are fast enough, you may get 0 packet loss. Otherwise you may need to perform the following actions:
Update your libpcap to the most recent version. Libpcap 1.0.0 or later, supposts zero-copy (memory-mapped) mechanism. It means that there is a buffer that's in both the kernel's address space and the application's address space, so that data doesn't need to be copied
from a kernel-mode buffer to a user-mode buffer. Packets are still copied from the skbuff (Linux) into the shared buffer, so it's really more like "one-copy", but that's still one fewer copy, so that could reduce the CPU time required to receive captured packets. Moreover more packets can be fetched from the buffer per application wake up call.
If you observe a high CPU utilization, it is probably your CPU that cannot handle the packet arrival rate. You can use xosview (a system load visualization tool) to check your system resources during the capture.
If the CPU drops packets, you can use PF_RING. PF_RING is an extension of libpcap with a circular buffer: http://www.ntop.org/products/pf_ring/. It is way faster and can capture with 10Gbps with commodity NICs http://www.ntop.org/products/pf_ring/hardware-packet-filtering/.
Another approach is to get a NIC that has an on-board memory and a specific HW design for packet capturing, see http://en.wikipedia.org/wiki/DAG_Technology.
If the CPU is not any more your problem, you need to test disk data transfer speed. hdparm is the simplest tool on Linux. Some distros come with a GUI, otherwise:
$ sudo hdparm -tT /dev/hda
If you are developing your own application based on libpcap:
Use pcap_stats to identify (a) the number of packets dropped because there was no room in the operating system's buffer when they arrived, because packets weren't being read fast enough; (b) number of packets dropped by the network interface or its driver.
Libpcap 1.0.0 has an API that lets an application set the buffer size, on platforms where the buffer size can be set.
b) If you find it hard to set the buffer, you can use Libpcap 1.1.0 or later in which the default capture buffer size has been increased from 32K to 512K.
c) If you are just using tcpdump, use 4.0.0 or later and use the -B flag for the size of the buffer

You don't say which Operating System or CPU. It doesn't matter whether you choose libpcap or not, the underlying network performance is still burdened by the Operating System Memory Management and its network driver. libpcap has kept up with the paces and can handle 10Gbps, but there's more.
If you want the best CPU so that you can do number-crunching, running virtual machines and while capturing packets, go with AMD Opteron CPU which still outperforms Intel Xeon Quadcore 5540 2.53GHz (despite Intel's XIO/DDIO introduction and mostly because of Intel dual-core sharing of same L2 cache). For best ready-made OS, go with latest FreeBSD as-is (which still outperforms Linux 3.10 networking using basic hardware.) Otherwise, Intel and Linux will works just fine for basic drop-free 10Gbps capture, provided you are eager to roll up your sleeves.
If you're pushing for breakneck speed all the time while doing financial-like or stochastic or large matrix predictive computational crunching (or something), then read-on...
As RedHat have discovered, 67.2 nanosecond is what it takes to process one minimal-sized packet at 10Gbps rate. I assert it's closer to 81.6 nanosecond for 64 byte Ethernet payload but they are talking 46-byte minimal as a theoretical.
To cut it short, you WON'T be able to DO or USE any of the following if you want 0% packet drop at full-rate by staying under 81.6 ns for each packet:
Make an SKB call for each packet (to minimize that overhead, amortized this over several 100s of packets)
TLB (Translation lookaside buffer, to avoid that, use HUGE page allocations)
Short latency (you did say 'capture', so latency is irrelevant here). It's called Interrupt Coalesce
(ethtool -C rx-frames 1024+).
Float processes across multi-CPU (must lock them down, one per network interface interrupt)
libc malloc() (must replace it with a faster one, preferably HUGE-based one)
So, Linux has an edge over FreeBSD to capture the 10Gbps rate in 0% drop rate AND run several virtual machines (and other overheads). Just that it requires a new memory management (MM) of some sort for a specific network device and not necessarily the whole operating system. Most new super-high-performance network driver are now making devices use HUGE memory that were allocated at userland then using driver calls to pass a bundle of packets at a time.
Many new network-driver having repurposed MM are out (in no particular order):
netmap
PF-RING
PF-RING+netmap
OpenOnload
DPDK
PacketShader
The maturity level of each code is highly dependent on which Linux (or distro) version you choose. I've tried a few of them and once I understood the basic design, it became apparent what I needed. YMMV.
Updated: White paper on high speed packet architecture: https://arxiv.org/pdf/1901.10664.pdf
Good luck.

PF_RING is a good solution an alternative can be netsniff-ng ( http://netsniff-ng.org/ ).
For both projects the gain of performance is reached by zero-copy mechanisms. Obviously, the bottleneck can be the HD, its data transfer rate.

If you have the time then move to Intel DPDK. It allows for Zero Copy access to the NIC's hardware register. I was able to achieve 0% drops at 10Gbps, 1.5Mpps on a single core.
You'll be better off in the long run

Related

sendmsg + raw ethernet + several frames

I use linux 3.x and modern glibc(2.19).
I would like send several Ethernet frames without switch from kernel/user space forth and back.
I have MTU = 1500, and I want to send 800 KB.
I init receiver address like this:
struct sockaddr_ll socket_address;
socket_address.sll_ifindex = if_idx.ifr_ifindex;
socket_address.sll_halen = ETH_ALEN;
socket_address.sll_addr[0] = MY_DEST_MAC0;
//...
After that I can call sendto/sendmsg 800KB / 1500 ~= 500 times and all works fine, but this require user space <-> kernel negotiation ~ 500 * 25 times per second. I want avoid it.
I try to init struct msghdr::msg_iov with appropriate info,
but get error "message too long", looks like msghdr::msg_iov can not describe something with size > MTU.
So question is it possible to send many raw Ethernet frame on Linux from userspace at once?
PS
The data (800KB) I get from file, and read it to memory. So struct iovec good for me, I can create suitable amount of Ethernet header and have to iovec per 1500 packet, one point to data, one point to Ethernet header.
Whoa.
My last company made realtime hidef video encoding hardware. In the lab, we had to blast 200MB / second across a bonded link, so I have some experience with this. What follows is based upon that.
Before you can tune, you must measure. You don't want to do multiple syscalls, but can you prove with timing measurement that the overhead is significant?
I use a wrapper routine around clock_gettime that gives back time of day with nanosecond precision (e.g. (tv_sec * 100000000) + tv_nsec). Call this [herein] "nanotime".
So, for any given syscall, you need a measurement:
tstart = nanotime();
syscall();
tdif = nanotime() - tstart;
For send/sendto/sendmsg/write, do this will small data so you're sure you're not blocking [or use O_NONBLOCK, if applicable]. This gives you the syscall overhead
Why are you going directly to ethernet frames? TCP [or UDP] is usually fast enough and modern NIC cards can do the envelope wrap/strip in hardware. I'd like to know if there is a specific situation that requires ethernet frames, or was it that you weren't getting the performance you wanted and came up with this as a solution. Remember, you're doing 800KB/s (~1MB/s) and my project was doing 100x-200x more than that over TCP.
What about using two plain write calls to the socket? One for header, one for data [all 800KB]. write can be used on a socket and doesn't have the EMSGSIZE error or restriction.
Further, why do you need your header to be in a separate buffer? When you allocate your buffer, just do:
datamax = 800 * 1024; // or whatever
buflen = sizeof(struct header) + datamax;
buf = malloc(buflen);
while (1) {
datalen = read(fdfile,&buf[sizeof(struct header)],datamax);
// fill in header ...
write(fdsock,buf,sizeof(struct header) + datalen);
}
This works even for the ethernet frame case.
One of the things can also do is use a setsockopt to increase the size of the kernel buffer for your socket. Otherwise, you can send data, but it will be dropped in the kernel before the receiver can drain it. More on this below.
To measure the performance of the wire, add some fields to your header:
u64 send_departure_time; // set by sender from nanotime
u64 recv_arrival_time; // set by receiver when the packet arrives
So, sender sets departure time and does write [just do the header for this test]. Call this packet Xs. receiver stamps this when it arrives. receiver immediately sends back a message to sender [call it Xr] with a departure stamp that and the contents of Xs. When sender gets this, it stamps it with an arrival time.
With the above we now have:
T1 -- time packet Xs departed sender
T2 -- time packet Xs arrived at receiver
T3 -- time packet Xr departed receiver
T4 -- time packet Xr arrived at sender
Assuming you do this on a relatively quiet connection with little to no other traffic and you know the link speed (e.g. 1 Gb/s), with T1/T2/T3/T4 you can calculate the overhead.
You can repeat the measurement for TCP/UDP vs ETH. You may find that it doesn't buy you as much as you think. Once again, can you prove it with precise measurement?
I "invented" this algorithm while working at the aforementioned company, only to find out that it was already part of a video standard for sending raw video across a 100Gb Ethernet NIC card and the NIC does the timestamping in hardware.
One of the other things you may have to do is add some throttle control. This is similar to what bittorrent does or what the PCIe bus does.
When PCIe bus nodes first start up, they communicate how much free buffer space they have available for "blind write". That is, the sender is free to blast up to this much, without any ACK message. As the receiver drains its input buffer, it sends periodic ACK messages to the sender with the number of bytes it was able to drain. sender can add this value back to the blind write limit and keep going.
For your purposes, the blind write limit is the size of the receiver's kernel socket buffer.
UPDATE
Based upon some of the additional information from your comments [the actual system configuration should go, in a more complete form, as an edit to your question at the bottom].
You do have a need for a raw socket and sending an ethernet frame. You can reduce the overhead by setting a larger MTU via ifconfig or an ioctl call with SIOCSIFMTU. I recommend the ioctl. You may not need to set MTU to 800KB. Your CPU's NIC card has a practical limit. You can probably increase MTU from 1500 to 15000 easily enough. This would reduce syscall overhead by 10x and that may be "good enough".
You probably will have to use sendto/sendmsg. The two write calls were based on conversion to TCP/UDP. But, I suspect sendmsg with msg_iov will have more overhead than sendto. If you search, you'll find that most example code for what you want uses sendto. sendmsg seems like less overhead for you, but it may cause more overhead for the kernel. Here's an example that uses sendto: http://hacked10bits.blogspot.com/2011/12/sending-raw-ethernet-frames-in-6-easy.html
In addition to improving syscall overhead, larger MTU might improve the efficiency of the "wire", even though this doesn't seem like a problem in your use case. I have experience with CPU + FPGA systems and communicating between them, but I am still puzzled by one of your comments about "not using a wire". FPGA connected to ethernet pins of CPU I get--sort of. More precisely, do you mean FPGA pins connected to ethernet pins of NIC card/chip of CPU"?
Are the CPU/NIC on the same PC board and the FPGA pins are connected via PC board traces? Otherwise, I don't understand "not using a wire".
However, once again, I must say that you must be able to measure your performance before you blindly try to improve it.
Have you run the test case I suggested for determining the syscall overhead? If it is small enough, trying to optimize for it may not be worth it and doing so may actually hurt performance more severely in other areas that you didn't realize when you started.
As an example, I once worked on a system that had a severe performance problem, such that, the system didn't work. I suspected the serial port driver was slow, so I recoded from a high level language (e.g. like C) into assembler.
I increased the driver performance by 2x, but it contributed less than a 5% performance improvement to the system. It turned out the real problem was that other code was accessing non-existent memory which just caused a bus timeout, slowing the system down measurably [it did not generate an interrupt that would have made it easy to find as on modern systems].
That's when I learned the importance of measurement. I had done my optimization based on an educated guess, rather than hard data. After that: lesson learned!
Nowadays, I never try large optimization until I can measure first. In some cases, I add an optimization that I'm "sure" will make things better (e.g. inlining a function). When I measure it [and because I can measure it], I find out that the new code is actually slower and I have to revert the change. But, that's the point: I can prove/disprove this with hard performance data.
What CPU are you using: x86, arm, mips, etc. At what clock frequency? How much DRAM? How many cores?
What FPGA are you using (e.g. Xilinx, Altera)? What specific type/part number? What is the maximum clock rate? Is the FPGA devoted entirely to logic or do you also have a CPU inside it such as microblaze, nios, arm? Does the FPGA have access to DRAM of it's own [and how much DRAM]?
If you increase the MTU, can the FPGA handle it, from either a buffer/space standpoint or a clock speed standpoint??? If you increase MTU, you may need to add an ack/sync protocol as I suggested in the original post.
Currently, the CPU is doing a blind write of the data, hoping the FPGA can handle it. This means you have an open race condition between CPU and FPGA.
This may be mitigated, purely as a side effect of sending small packets. If you increase MTU too much, you might overwhelm the FPGA. In other words, it is the very overhead you're trying to optimize away, that allows the FPGA to keep up with the data rate.
This is what I meant by unintended consequences of blind optimization. It can have unintended and worse side effects.
What is the nature of the data being sent to the FPGA? You're sending 800KB, but how often?
I am assuming that it is not the FPGA firmware itself for a few reasons. You said the firmware was already almost full [and it is receiving the ethernet data]. Also, firmware is usually loaded via the I2C bus, a ROM, or an FPGA programmer. So, am I correct?
You're sending the data to the FPGA from a file. This implies that it is only being sent once, at the startup of your CPU's application. Is that correct? If so, optimization is not needed because it's an init/startup cost that has little impact on the running system.
So, I have to assume that the file gets loaded many times, possibly a different file each time. Is that correct? If so, you may need to consider the impact of the read syscall. Not just from syscall overhead, but optimal read length. For example, IIRC, the optimal transfer size for a disk-to-disk or file-to-file copy/transfer is 64KB, depending upon the filesystem or underlying disk characteristics.
So, if you're looking to reduce overhead, reading data from a file may have considerably more than having the application generate the data [if that's possible].
The kernel syscall interface is designed to be very low overhead. Kernel programmers [I happen to be one] spend a great deal of time ensuring the overhead is low.
You say your system is utilizing the a lot of CPU time for other things. Can you measure the other things? How is your application structured? How many processes? How many threads? How do they communicate? What is the latency/througput? You may be able to find [can quite probably find] the larger bottlenecks and recode those and you'll get an overall reduction in CPU usage that far exceeds the maximum benefit you'll get from the MTU tweak.
Trying to optimize the syscall overhead may be like my serial port optimization. A lot of effort, and yet the overall results are/were disappointing.
When considering performance, it is important to consider it from an overall system standpoint. In your case, this means CPU, FPGA, and anything else in it.
You say that the CPU is doing a lot of things. Could/should some of those algorithms go into the FPGA? Is the reason they're not because the FPGA is almost out of space, otherwise you would? Is the FPGA firmware 100% done? Or, is there more RTL to be written? If you're at 90% space utilization in the FPGA, and you'll need more RTL, you may wish to consider going to an FPGA part that has more space for logic, possibly with a higher clock rate.
In my video company, we used FPGAs. We used the largest/fastest state-of-the-art part the FPGA vendor had. We also used virtually 100% of the space for logic and required the part's maximum clock rate. We were told by the vendor that we were the largest consumer of FPGA resources of any of their client companies worldwide. Because of this, we were straining the vendors development tools. Place-and-route would frequently fail and have to be rerun to get correct placement and meet timing.
So, when an FPGA is almost full with logic, the place-and-route can be difficult to achieve. It might be a reason to consider a larger part [if possible]

Linux Timing across Kernel & User Space

I'm writing a kernel module for a special camera, working through V4L2 to handle transfer of frames to userspace code.. Then I do lots of userspace stuff in the app..
Timing is very critical here, so I've been doing lots of performance profiling and plain old std::chrono::steady_clock stuff to track timing, but I've reached the point where I need to also collect timing data from the Kernel side of things so that I can analyze the entire path from hardware interrupt through V4L DQBuf to userspace...
Can anyone recommend a good way to get high-resolution timing data, that would be consistent with application userspace data, that I could use for such comparisons? Right now I'm measuring activity in microseconds..
Ubuntu 12.04 LTS
At the lowest level, there are the rdtsc and rdtscp instructions if you're on an x86/x86-64 processor. That should provide the lowest overhead, highest possible resolution across the kernel/userspace boundary.
However, there are things you need to worry about. You need to make sure you're executing across the same core/cpu, the process isn't being context switched, and the frequency isn't changing across invocations. If the cpu supports an invariant tsc, (constant_tsc in /proc/cpuinfo) it's a little more reliable across cpus/cores and frequencies.
This should provide roughly nanosecond accuracy.
There are lot of kernel level utilities available that can get the timing related traces for you. For eg ptrace, ftrace, LTTng, Kprobes. Check out this link for more information.
http://elinux.org/Kernel_Trace_Systems

C/C++ technologies involved in sending data across networks very fast

In terms of low latency (I am thinking about financial exchanges/co-location- people who care about microseconds) what options are there for sending packets from a C++ program on two Unix computers?
I have heard about kernel bypass network cards, but does this mean you program against some sort of API for the card? I presume this would be a faster option in comparison to using the standard Unix berkeley sockets?
I would really appreciate any contribution, especially from persons who are involved in this area.
EDITED from milliseconds to microseconds
EDITED I am kinda hoping to receive answers based more upon C/C++, rather than network hardware technologies. It was intended as a software question.
UDP sockets are fast, low latency, and reliable enough when both machines are on the same LAN.
TCP is much slower than UDP but when the two machines are not on the same LAN, UDP is not reliable.
Software profiling will stomp obvious problems with your program. However, when you are talking about network performance, network latency is likely to be you largest bottleneck. If you are using TCP, then you want to do things that avoid congestion and loss on your network to prevent retransmissions. There are a few things to do to cope:
Use a network with bandwidth and reliability guarantees.
Properly size your TCP parameters to maximize utilization without incurring loss.
Use error correction in your data transmission to correct for the small amount of loss you might encounter.
Or you can avoid using TCP altogether. But if reliability is required, you will end up implementing much of what is already in TCP.
But, you can leverage existing projects that have already thought through a lot of these issues. The UDT project is one I am aware of, that seems to be gaining traction.
At some point in the past, I worked with a packet sending driver that was loaded into the Windows kernel. Using this driver it was possible to generate stream of packets something 10-15 times stronger (I do not remember exact number) than from the app that was using the sockets layer.
The advantage is simple: The sending request comes directly from the kernel and bypasses multiple layers of software: sockets, protocol (even for UDP packet simple protocol driver processing is still needed), context switch, etc.
Usually reduced latency comes at a cost of reduced robustness. Compare for example the (often greatly advertised) fastpath option for ADSL. The reduced latency due to shorter packet transfer times comes at a cost of increased error susceptibility. Similar technologies migt exist for a large number of network media. So it very much depends on the hardware technologies involved. Your question suggests you're referring to Ethernet, but it is unclear whether the link is Ethernet-only or something else (ATM, ADSL, …), and whether some other network technology would be an option as well. It also very much depends on geographical distances.
EDIT:
I got a bit carried away with the hardware aspects of this question. To provide at least one aspect tangible at the level of application design: have a look at zero-copy network operations like sendfile(2). They can be used to eliminate one possible cause of latency, although only in cases where the original data came from some source other than the application memory.
As my day job, I work for a certain stock exchange. Below answer is my own opinion from the software solutions which we provide exactly for this kind of high throughput low latency data transfer. It is not intended in any way to be taken as marketing pitch(please i am a Dev.)This is just to give what are the Essential components of the software stack in this solution for this kind of fast data( Data could be stock/trading market data or in general any data):-
1] Physical Layer - Network interface Card in case of a TCP-UDP/IP based Ethernet network, or a very fast / high bandwidth interface called Infiniband Host Channel Adaptor. In case of IP/Ethernet software stack, is part of the OS. For Infiniband the card manufacturer (Intel, Mellanox) provide their Drivers, Firmware and API library against which one has to implement the socket code(Even infiniband uses its own 'socketish' protocol for network communications between 2 nodes.
2] Next layer above the physical layer we have is a Middleware which basically abstracts the lower network protocol nittigritties, provides some kind of interface for data I/O from physical layer to application layer. This layer also provides some kind of network data quality assurance (IF using tCP)
3] Last layer would be a application which we provide on top of middleware. Any one who gets 1] and 2] from us, can develop a low latency/hight throughput 'data transfer of network' kind of app for stock trading, algorithmic trading kind os applications using a choice of programming language interfaces - C,C++,Java,C#.
Basically a client like you can develop his own application in C,C++ using the APIs we provide, which will take care of interacting with the NIC or HCA(i.e. the actual physical network interface) to send and receive data fast, really fast.
We have a comprehensive solution catering to different quality and latency profiles demanded by our clients - Some need Microseconds latency is ok but they need high data quality/very little errors; Some can tolerate a few errors, but need nano seconds latency, Some need micro seconds latency, no errors tolerable, ...
If you need/or are interested in any way in this kind of solution , ping me offline at my contacts mentioned here at SO.

Unexpected Socket CPU Utilization

I'm having a performance issue that I don't understand. The system I'm working on has two threads that look something like this:
Version A:
Thread 1: Data Processing -> Data Selection -> Data Formatting -> FIFO
Thread 2: FIFO -> Socket
Where 'Selection' thins down the data and the FIFO at the end of thread 1 is the FIFO at the beginning of thread 2 (the FIFOs are actually TBB Concurrent Queues). For performance reasons, I've altered the threads to look like this:
Version B:
Thread 1: Data Processing -> Data Selection -> FIFO
Thread 2: FIFO -> Data Formatting -> Socket
Initially, this optimization proved to be successful. Thread 1 is capable of much higher throughput. I didn't look too hard at Thread 2's performance because I expected the CPU usage would be higher and (due to data thinning) it wasn't a major concern. However, one of my colleagues asked for a performance comparison of version A and version B. To test the setup I had thread 2's socket (a boost asio tcp socket) write to an instance of iperf on the same box (127.0.0.1) with the goal of showing the maximum throughput.
To compare the two set ups I first tried forcing the system to write data out of the socket at 500 Mbps. As part of the performance testing I monitored top. What I saw surprised me. Version A did not show up on 'top -H' nor did iperf (this was actually as suspected). However, version B (my 'enhanced version') was showing up on 'top -H' with ~10% cpu utilization and (oddly) iperf was showing up with 8%.
Obviously, that implied to me that I was doing something wrong. I can't seem to prove that I am though! Things I've confirmed:
Both versions are giving the socket 32k chunks of data
Both versions are using the same boost library (1.45)
Both have the same optimization setting (-O3)
Both receive the exact same data, write out the same data, and write it at the same rate.
Both use the same blocking write call.
I'm testing from the same box with the exact same setup (Red Hat)
The 'formatting' part of thread 2 is not the issue (I removed it and reproduced the problem)
Small packets across the network is not the issue (I'm using TCP_CORK and I've confirmed via wireshark that the TCP Segments are all ~16k).
Putting a 1 ms sleep right after the socket write makes the CPU usage on both the socket thread and iperf(?!) go back to 0%.
Poor man's profiler reveals very little (the socket thread is almost always sleeping).
Callgrind reveals very little (the socket write barely even registers)
Switching iperf for netcat (writing to /dev/null) doesn't change anything (actually netcat's cpu usage was ~20%).
The only thing I can think of is that I've introduced a tighter loop around the socket write. However, at 500 Mbps I wouldn't expect that the cpu usage on both my process and iperf would be increased?
I'm at a loss to why this is happening. My coworkers and I are basically out of ideas. Any thoughts or suggestions? I'll happily try anything at this point.
This is going to be very hard to analyze without code snippets or actual data quantities.
One thing that comes to mind: if the pre-formatted data stream is significantly larger than post-format, you may be expending more bandwidth/cycles copying a bunch more data through the FIFO (socket) boundary.
Try estimating or measuring the data rate at each stage. If the data rate is higher at the output of 'selection', consider the effects of moving formatting to the other side of the boundary. Is it possible that no copy is required for the select->format transition in configuration A, and configuration B imposes lots of copies?
... just a guesses without more insight into the system.
What if the FIFO was the bottleneck in version A. Then both threads would sit and wait for the FIFO most of the time. And in version B, you'd be handing the data off to iperf faster.
What exactly do you store in the FIFO queues? Do you store packets of data i.e buffers?
In version A, you were writing formatted data (probably bytes) to the queue. So, sending it on the socket involved just writing out a fixed size buffer.
However in version B, you are storing high level data in the queues. Formatting it is now creating bigger buffer sizes that are being written directly to the socket. This causes the TCp/ip stack to spend CPU cycles in fragmenting & overhead...
THis is my theory based on what you have said so far.

How to use DSP to speed-up a code on OMAP?

I'm working on a video codec for OMAP3430. I already have code written in C++, and I try to modify/port certain parts of it to take advantage of the DSP (the SDK (OMAP ZOOM3430 SDK) I have has an additional DSP).
I tried to port a small for loop which is running over a very small amount of data (~250 bytes), but about 2M times on different data. But the overload from the communication between CPU and DSP is much more than the gain (if I have any).
I assume this task is much like optimizing a code for the GPU's in normal computers. My question is porting what kind of parts would be beneficial? How do GPU programmers take care of such tasks?
edit:
GPP application allocates a buffer of size 0x1000 bytes.
GPP application invokes DSPProcessor_ReserveMemory to reserve a DSP virtual address space for each allocated buffer using a size that is 4K greater than the allocated buffer to account for automatic page alignment. The total reservation size must also be aligned along a 4K page boundary.
GPP application invokes DSPProcessor_Map to map each allocated buffer to the DSP virtual address spaces reserved in the previous step.
GPP application prepares a message to notify the DSP execute phase of the base address of virtual address space, which have been mapped to a buffer allocated on the GPP. GPP application uses DSPNode_PutMessage to send the message to the DSP.
GPP invokes memcpy to copy the data to be processed into the shared memory.
GPP application invokes DSPProcessor_FlushMemory to ensure that the data cache has been flushed.
GPP application prepares a message to notify the DSP execute phase that it has finished writing to the buffer and the DSP may now access the buffer. The message also contains the amount of data written to the buffer so that the DSP will know just how much data to copy. The GPP uses DSPNode_PutMessage to send the message to the DSP and then invokes DSPNode_GetMessage to wait to hear a message back from the DSP.
After these the execution of DSP program starts, and DSP notifies the GPP with a message when it finishes the processing. Just to try I don't put any processing inside the DSP program. I just send a "processing finished" message back to the GPP. And this still consumes a lot of time. Could that be because of the internal/external memory usage, or is it merely because of the communication overload?
The OMAP3430 does not have an on board DSP, it has a IVA2+ Video/Audio decode engine hooked to the system bus and the Cortex core has DSP-like SIMD instructions. The GPU on the OMAP3430 is a PowerVR SGX based unit. While it does have programmable shaders and i don't believe there is any support for general purpose programming ala CUDA or OpenCL. I could be wrong but I've never heard of such support
If your using the IVA2+ encode/decode engine that is on board you need to use the proper libraries for this unit and it only supports specific codecs from that I know. Are you trying to write your own library to this module?
If your using the Cortex's built in DSPish (SIMD instructions), post some code.
If your dev board has some extra DSP on it, what is the DSP and how is it connected to the OMAP?
As to the desktop GPU question, in the case of video decode you use the vender supplied function libraries to make calls to the hardware, there are several, VDAPU for Nvidia on linux, similar libraries on windows(PureViewHD I think its called). ATI also has both linux and windows libraries for their on board decode engines, i don't know the names.
I don't know what the time base your transfering data in is, but I know the TMS32064x which is listed on the specsheet for the SDK has a very powerful DMA engine. (I'm assuming it's the orignal ZOOM OMAP34X MDK. It says it has a 64xx.) I would hope the OMAP has something simalar, use them to their fullest advantage. I would recomend setting up "ping-pong" buffers in the interal ram of the 64xx and using the SDRAM as shared memory with the transfers handle by DMA. External RAM is going to be a bottleneck on any of the 6xxx series parts so keep whatever you can locked into internal memory to improve performance. Typically these parts will have the ability to bus 8 32bits words to the processor core once it's in internal memory, but that vary from part to part based on what level cache it allows you to map as direct access ram. Cost sensitive parts from TI move the "mappable memory" farther away than some of the other chips. Also all the manuals for the parts are available from TI for free download in PDF. They even gave me hardcopies for free of the TMS320C6000 CPU and Instruction Set manual and many other books.
As far as programming is concerned you may need to use some of the "processor intrinsics" or inline assembly to optimize any math you are doing. For the 64xx favor integer operation when possible because it doesn't have a built in floating point core. (Those are in the 67xx series.) If look at the excution units and you can map your calculations such that the different parts target different operations in a manner which can occur in a single cycle then you will be able to achive the best performance out of those parts. The instruction set manual list the types of ops that are performed by each execution unit. If you can break you calculation up in to a dual data flow sets and unwind the loops a bit the compiler will be "nicer" to you when full optimizaiton is on. This is due to the fact that the processor is broken up into a left and a right side with nearly identical execution units on either side.
Hope this helps.
From the measurements I did, one messaging cycle between CPU and DSP takes about 160us. I don't know whether this is because of the kernel I use, or the bridge driver; but this is a very long time for a simple back & forth messaging.
It seems that it is only reasonable to port an algorithm to DSP if the total computational load is comparable to the time required for messaging; and if the algorithm is suitable for simultaneous computing on CPU and DSP.