CRC/Parity/Hamming Protect 16-bit parallel bus - c++

I've got a Cortex-M4 based MCU linked to a FPGA via a 16-bit parallel memory bus interface. In essence the FPGA behaves like an external memory mapped to the memory space of the MCU: the MCU presents an address followed by either a data word (write) or reading the word presented by the FPGA (read).
I want to protect both read and write against transmission errors both during addressing and data write/read. However, I don't expect many bit errors since the distance between both parts is short.
I can easily implement checking and generating of either parity, hamming codes or CRC inside the FPGA. However, doing the same (checking and generating) in the uC seems comparatively harder since I don't want to cripple the throughput. Without error detection, reading and writing of 16-bit words takes around 4-6 processor cycles and is thus rather fast. Consequently I don't want to spend hundred of cycles on protective measures.
In the end I am looking for a moderately efficient error detection method for 16-bit data that is implemented in a uC in as few cycles as possible.

It's (in my experience) quite rare to protect a parallel bus like this. It's of course done in PC and server class hardware with ECC RAM and so on, but rarely in microcontrollers.
If your particular Cortex-M4 implementation has a hardware CRC block, you might be able to stream the data there, assuming you can simply add a word of CRC to the end of each bus transfer. That would probably still slow it down by at least a factor of 2-3 though, since each word coming to/from the FPGA must also be fed in software to the CRC unit.

Related

Can I use SIMD for speeding up string manipulation?

Are SIMD instructions built for vector numerical calculations only? Or does it lend itself well to a class of string manipulation tasks like, writing rows of data to a text file where the order of the rows does not matter? If so which APIs or libraries should I start with?
Yes! And this is actually done in high performance parsing libraries. One example: simdjson- a parser than can parse gigabytes of JSON per second. There's an About simdjson section in the readme, which has a link to a talk that goes over some of the implementation details.
SIMD instructions operate on numeric values, but once you're at that level, "text" is just numeric values, e.g. UTF-8 codepoints are just unsigned 8-bit integers, with plenty of SIMD support. Processing bitmaps is full of operations on multiple 8-bit unsigned integers in parallel, and it just so conveniently happens that this is so common that SIMD instruction sets cover these operations, and plenty of them are thus also usable for text processing.
I/O is so many orders of magnitude slower than the CPU
Not really. It is slower, but when the CPU has to do tasks that kill the streaming performance, such as branch mispredictions, cache misses, or wasting lots of speculative execution resources on dead-ends, the CPU can very easily not keep up with I/O. Modern network cards used for fast storage access or multi-machine communications can saturate the CPU's memory ports. All of them. And keep them that way. But that's state of the art and quite expensive at the moment (bonded 50 GBit links and such). Sequential, byte-at-a-time parser code is way slower than that.
Yes, especially for ASCII e.g. Convert a String In C++ To Upper Case. Or checking for valid UTF-8 (https://lemire.me/blog/2020/10/20/ridiculously-fast-unicode-utf-8-validation/), or checking if a string happens to be the ASCII subset of UTF-8. (If so, you know you have fixed-width characters which is very useful for other things.)
As Daniel Lemire reported, an early attempt at UTF-8 validation gave "a few CPU cycles per character." But with SIMD, he and collaborators were able to achieve ~1 instruction per byte, for net speeds of ~12GB/s. (vs. DRAM bandwidth of a Haswell desktop being ~25GB/s, or Skylake at 34GB/s with DDR4-2133).
Of course, most C libraries already have hand-written asm implementations of functions like strlen, strcpy, strcasecmp, strstr, etc. that use SIMD if it's a win (like on x86-64 where pmovmskb allows relatively efficient compare/branch on any/all SIMD compare results being true or false.) The first part of my answer on Why does glibc's strlen need to be so complicated to run quickly? has some links to hand-optimized asm that glibc actually uses on mainstream platforms, instead of the portable plain C fallback the question is asking about.
https://github.com/WojciechMula/sse4-strstr has a variety of strstr implementations. Substring searching is a much harder problem, with non-trivial algorithm choices as well as just brute-force. The SSE4.2 "string" instructions may help for that, but if not then SIMD vector compares definitely can for better brute-force building blocks.
(SSE4.2 "string" instructions like pcmpistri are definitely worse for memcmp / strcmp and strlen where plain SSE2 (or AVX2) is better. See How much faster are SSE4.2 string instructions than SSE2 for memcmp? and https://www.strchr.com/strcmp_and_strlen_using_sse_4.2)
You can even do cool tricks with looking up a shuffle control vector based on a vector compare bitmap, e.g. Fastest way to get IPv4 address from string or How to implement atoi using SIMD?.
Although I'm not sure the SIMD atoi is a win vs. scalar, especially for short numbers.
Naively I would say SIMD would not help since for long strings memory bandwidth would be the bottleneck. Why is this not the case?
DRAM bandwidth is really pretty good compared to modern CPU speeds, especially when the data comes in byte chunks, not 8-byte double chunks. And data is often hot in L3 cache after copying (e.g. from a read system call).
Even if data has to come from DRAM, modern desktop / laptop CPUs can load about 8 bytes per core clock cycle, within a factor of 2 of that anyway, especially if this core isn't competing with other bandwidth-intensive code on other cores. Good luck keeping up with that with byte-at-a-time scalar loops.
Besides, if you just did a read() system call to get the kernel to memcpy some data from a network buffer or pagecache into your process's memory, the data might still be hot in L3 cache, or even L2. Xeon CPUs can even DMA into L3 cache, or something like that. Aiming for memory bandwidth is a pretty low / unambitious goal, and a poor excuse for not fully optimizing a function if it actually gets use a lot.
Fewer instructions to process the same data lets out-of-order exec "see" farther ahead, and start demand-loads for later pages / cache lines earlier in cases where HW prefetch wouldn't (e.g. across page boundaries). And also better overlap the string processing with earlier / later independent work.
It can also be more hyperthreading-friendly, leaving the HT sibling core with better throughput if anything's running on it. (Maybe nothing if there aren't a lot of threads active). Also, if SIMD is efficient enough, it may save energy: tracking instructions through the pipeline is a large part of the cost, not the integer execution units themselves. Higher power while running, but finishing sooner, is good: race to sleep. CPUs save much more power when fully idle than when just running "cheap" instructions.
SIMD instructions are used on a very low level. Writing data to a text file is a much higher level, involving buffered I/O etc.
You might use SIMD, e.g., to convert a string from lower case to upper case. Wrapping SIMD into a library would be moot. You write the instructions yourself. Which also means that they are processor-specific (e.g. SSE variants on x86/AMD64).
For processing several rows of text in parallel, you might use micro-parallelization instead, e.g. offered by OpenMP or TBB.
However, if you stick to the example of writing to a the text file, we get to another territory of performance optimizations (I/O instead of computation).

Can modern x86 hardware not store a single byte to memory?

Speaking of the memory model of C++ for concurrency, Stroustrup's C++ Programming Language, 4th ed., sect. 41.2.1, says:
... (like most modern hardware) the machine could not load or store anything smaller than a word.
However, my x86 processor, a few years old, can and does store objects smaller than a word. For example:
#include <iostream>
int main()
{
char a = 5;
char b = 25;
a = b;
std::cout << int(a) << "\n";
return 0;
}
Without optimization, GCC compiles this as:
[...]
movb $5, -1(%rbp) # a = 5, one byte
movb $25, -2(%rbp) # b = 25, one byte
movzbl -2(%rbp), %eax # load b, one byte, not extending the sign
movb %al, -1(%rbp) # a = b, one byte
[...]
The comments are by me but the assembly is by GCC. It runs fine, of course.
Obviously, I do not understand what Stroustrup is talking about when he explains that hardware can load and store nothing smaller than a word. As far as I can tell, my program does nothing but load and store objects smaller than a word.
The thoroughgoing focus of C++ on zero-cost, hardware-friendly abstractions sets C++ apart from other programming languages that are easier to master. Therefore, if Stroustrup has an interesting mental model of signals on a bus, or has something else of this kind, then I would like to understand Stroustrup's model.
What is Stroustrup talking about, please?
LONGER QUOTE WITH CONTEXT
Here is Stroustrup's quote in fuller context:
Consider what might happen if a linker allocated [variables of char type like] c and b in the same word in memory and (like most modern hardware) the machine could not load or store anything smaller than a word.... Without a well-defined and reasonable memory model, thread 1 might read the word containing b and c, change c, and write the word back into memory. At the same time, thread 2 could do the same with b. Then, whichever thread managed to read the word first and whichever thread managed to write its result back into memory last would determine the result....
ADDITIONAL REMARKS
I do not believe that Stroustrup is talking about cache lines. Even if he were, as far as I know, cache coherency protocols would transparently handle that problem except maybe during hardware I/O.
I have checked my processor's hardware datasheet. Electrically, my processor (an Intel Ivy Bridge) seems to address DDR3L memory by some sort of 16-bit multiplexing scheme, so I don't know what that's about. It is not clear to me that that has much to do with Stroustrup's point, though.
Stroustrup is a smart man and an eminent scientist, so I do not doubt that he is taking about something sensible. I am confused.
See also this question. My question resembles the linked question in several ways, and the answers to the linked question are also helpful here. However, my question goes also to the hardware/bus model that motivates C++ to be the way it is and that causes Stroustrup to write what he writes. I do not seek an answer merely regarding that which the C++ standard formally guarantees, but also wish to understand why the C++ standard would guarantee it. What is the underlying thought? This is part of my question, too.
TL:DR: On every modern ISA that has byte-store instructions (including x86), they're atomic and don't disturb surrounding bytes. (I'm not aware of any older ISAs where byte-store instructions could "invent writes" to neighbouring bytes either.)
The actual implementation mechanism (in non-x86 CPUs) is sometimes an internal RMW cycle to modify a whole word in a cache line, but that's done "invisibly" inside a core while it has exclusive ownership of the cache line so it's only ever a performance problem, not correctness. (And merging in the store buffer can sometimes turn byte-store instructions into an efficient full-word commit to L1d cache.)
About Stroustrup's phrasing
I don't think it's a very accurate, clear or useful statement. It would be more accurate to say that modern CPUs can't load or store anything smaller than a cache line. (Although that's not true for uncacheable memory regions, e.g. for MMIO.)
It probably would have been better just to make a hypothetical example to talk about memory models, rather than implying that real hardware is like this. But if we try, we can maybe find an interpretation that isn't as obviously or totally wrong, which might have been what Stroustrup was thinking when he wrote this to introduce the topic of memory models. (Sorry this answer is so long; I ended up writing a lot while guessing what he might have meant and about related topics...)
Or maybe this is another case of high-level language designers not being hardware experts, or at least occasionally making mis-statements.
I think Stroustrup is talking about how CPUs work internally to implement byte-store instructions. He's suggesting that a CPU without a well-defined and reasonable memory model might implement a byte-store with a non-atomic RMW of the containing word in a cache line, or in memory for a CPU without cache.
Even this weaker claim about internal (not externally visible) behaviour is not true for high-performance x86 CPUs. Modern Intel CPUs have no throughput penalty for byte stores, or even unaligned word or vector stores that don't cross a cache-line boundary. AMD is similar.
If byte or unaligned stores had to do a RMW cycle as the store committed to L1D cache, it would interfere with store and/or load instruction/uop throughput in a way we could measure with performance counters. (In a carefully designed experiment that avoids the possibility of store coalescing in the store buffer before commit to L1d cache hiding the cost, because the store execution unit(s) can only run 1 store per clock on current CPUs.)
However, some high performance designs for non-x86 ISAs do use an atomic RMW cycle to internally commit stores to L1d cache. Are there any modern CPUs where a cached byte store is actually slower than a word store? The cache line stays in MESI Exclusive/Modified state the whole time, so it can't introduce any correctness problems, only a small performance hit. This is very different from doing something that could step on stores from other CPUs. (The arguments below about that not happening still apply, but my update may have missed some stuff that still argues that atomic cache-RMW is unlikely.)
(On many non-x86 ISAs, unaligned stores are not supported at all, or are used more rarely than in x86 software. And weakly-ordered ISAs allow more coalescing in store buffers, so not as many byte store instructions actually result in single-byte commit to L1d. Without these motivations for fancy (power hungry) cache-access hardware, word RMW for scattered byte stores is an acceptable tradeoff in some designs.)
Alpha AXP, a high-performance RISC design from 1992, famously (and uniquely among modern non-DSP ISAs) omitted byte load/store instructions until Alpha 21164A (EV56) in 1996. Apparently they didn't consider word-RMW a viable option for implementing byte stores, because one of the cited advantages for implementing only 32-bit and 64-bit aligned stores was more efficient ECC for the L1D cache. "Traditional SECDED ECC would require 7 extra bits over 32-bit granules (22% overhead) versus 4 extra bits over 8-bit granules (50% overhead)." (#Paul A. Clayton's answer about word vs. byte addressing has some other interesting computer-architecture stuff.) If byte stores were implemented with word-RMW, you could still do error detection/correction with word-granularity.
Current Intel CPUs only use parity (not ECC) in L1D for this reason. (At least some older Xeons could run with L1d in ECC mode at half capacity instead of the normal 32KiB, as discussed on RWT. It's not clear if anything's changed, e.g. in terms of Intel now using ECC for L1d). See also this Q&A about hardware (not) eliminating "silent stores": checking the old contents of cache before the write to avoid marking the line dirty if it matched would require a RMW instead of just a store, and that's a major obstacle.
It turns out some high-perf pipelined designs do use atomic word-RMW to commit to L1d, despite it stalling the memory pipeline, but (as I argue below) it's much less likely that any do an externally-visible RMW to RAM.
Word-RMW isn't a useful option for MMIO byte stores either, so unless you have an architecture that doesn't need sub-word stores for IO, you'd need some kind of special handling for IO (like Alpha's sparse I/O space where word load/stores were mapped to byte load/stores so it could use commodity PCI cards instead of needing special hardware with no byte IO registers).
As #Margaret points out, DDR3 memory controllers can do byte stores by setting control signals that mask out other bytes of a burst. The same mechanisms that get this information to the memory controller (for uncached stores) could also get that information passed along with a load or store to MMIO space. So there are hardware mechanisms for really doing
a byte store even on burst-oriented memory systems, and it's highly likely that modern CPUs will use that instead of implementing an RMW, because it's probably simpler and is much better for MMIO correctness.
How many and what size cycles will be needed to perform longword transferred to the CPU shows how a ColdFire microcontroller signals the transfer size (byte/word/longword/16-byte line) with external signal lines, letting it do byte loads/stores even if 32-bit-wide memory was hooked up to its 32-bit data bus. Something like this is presumably typical for most memory bus setups (but I don't know). The ColdFire example is complicated by also being configurable to use 16 or 8-bit memory, taking extra cycles for wider transfers. But nevermind that, the important point is that it has external signaling for the transfer size, to tell the memory HW which byte it's actually writing.
Stroustrup's next paragraph is
"The C++ memory model guarantees that two threads of execution can update and access separate memory locations without interfering with each other. This is exactly what we would naively expect. It is the compiler’s job to protect us from the sometimes very strange and subtle behaviors of modern hardware. How a compiler and hardware combination achieves that is up to the compiler. ..."
So apparently he thinks that real modern hardware may not provide "safe" byte load/store. The people who design hardware memory models agree with the C/C++ people, and realize that byte store instructions would not be very useful to programmers / compilers if they could step on neighbouring bytes.
All modern (non-DSP) architectures except early Alpha AXP have byte store and load instructions, and AFAIK these are all architecturally defined to not affect neighbouring bytes. However they accomplish that in hardware, software doesn't need to care about correctness. Even the very first version of MIPS (in 1983) had byte and half-word loads/stores, and it's a very word-oriented ISA.
However, he doesn't actually claim that most modern hardware needs any special compiler support to implement this part of the C++ memory model, just that some might. Maybe he really is only talking about word-addressable DSPs in that 2nd paragraph (where C and C++ implementations often use 16 or 32-bit char as exactly the kind of compiler workaround Stroustrup was talking about.)
Most "modern" CPUs (including all x86) have an L1D cache. They will fetch whole cache lines (typically 64 bytes) and track dirty / not-dirty on a per-cache-line basis. So two adjacent bytes are pretty much exactly the same as two adjacent words, if they're both in the same cache line. Writing one byte or word will result in a fetch of the whole line, and eventually a write-back of the whole line. See Ulrich Drepper's What Every Programmer Should Know About Memory. You're correct that MESI (or a derivative like MESIF/MOESI) makes sure this isn't a problem. (But again, this is because hardware implements a sane memory model.)
A store can only commit to L1D cache while the line is in the Modified state (of MESI). So even if the internal hardware implementation is slow for bytes and takes extra time to merge the byte into the containing word in the cache line, it's effectively an atomic read modify write as long as it doesn't allow the line to be invalidated and re-acquired between the read and the write. (While this cache has the line in Modified state, no other cache can have a valid copy). See #old_timer's comment making the same point (but also for RMW in a memory controller).
This is easier than e.g. an atomic xchg or add from a register that also needs an ALU and register access, since all the HW involved is in the same pipeline stage, which can simply stall for an extra cycle or two. That's obviously bad for performance and takes extra hardware to allow that pipeline stage to signal that it's stalling. This doesn't necessarily conflict with Stroustrup's first claim, because he was talking about a hypothetical ISA without a memory model, but it's still a stretch.
On a single-core microcontroller, internal word-RMW for cached byte stores would be more plausible, since there won't be Invalidate requests coming in from other cores that they'd have to delay responding to during an atomic RMW cache-word update. But that doesn't help for I/O to uncacheable regions. I say microcontroller because other single-core CPU designs typically support some kind of multi-socket SMP.
Many RISC ISAs don't support unaligned-word loads/stores with a single instruction, but that's a separate issue (the difficulty is handling the case when a load spans two cache lines or even pages, which can't happen with bytes or aligned half-words). More and more ISAs are adding guaranteed support for unaligned load/store in recent versions, though. (e.g. MIPS32/64 Release 6 in 2014, and I think AArch64 and recent 32-bit ARM).
The 4th edition of Stroustrup's book was published in 2013 when Alpha had been dead for years. The first edition was published in 1985, when RISC was the new big idea (e.g. Stanford MIPS in 1983, according to Wikipedia's timeline of computing HW, but "modern" CPUs at that time were byte-addressable with byte stores. Cyber CDC 6600 was word-addressable and probably still around, but couldn't be called modern.
Even very word-oriented RISC machines like MIPS and SPARC have byte store and byte load (with sign or zero extension) instructions. They don't support unaligned word loads, simplifying the cache (or memory access if there is no cache) and load ports, but you can load any single byte with one instruction, and more importantly store a byte without any architecturally-visible non-atomic rewrite of the surrounding bytes. (Although cached stores can
I suppose C++11 (which introduces a thread-aware memory model to the language) on Alpha would need to use 32-bit char if targeting a version of the Alpha ISA without byte stores. Or it would have to use software atomic-RMW with LL/SC when it couldn't prove that no other threads could have a pointer that would let them write neighbouring bytes.
IDK how slow byte load/store instructions are in any CPUs where they're implemented in hardware but not as cheap as word loads/stores. Byte loads are cheap on x86 as long as you use movzx/movsx to avoid partial-register false dependencies or merging stalls. On AMD pre-Ryzen, movsx/movzx needs an extra ALU uop, but otherwise zero/sign extension is handled right in the load port on Intel and AMD CPUs.) The main x86 downside is that you need a separate load instruction instead of using a memory operand as a source for an ALU instruction (if you're adding a zero-extended byte to a 32-bit integer), saving front-end uop throughput bandwidth and code-size. Or if you're just adding a byte to a byte register, there's basically no downside on x86. RISC load-store ISAs always need separate load and store instructions anyway. x86 byte stores are no more expensive that 32-bit stores.
As a performance issue, a good C++ implementation for hardware with slow byte stores might put each char in its own word and use word loads/stores whenever possible (e.g. for globals outside structs, and for locals on the stack). IDK if any real implementations of MIPS / ARM / whatever have slow byte load/store, but if so maybe gcc has -mtune= options to control it.
That doesn't help for char[], or dereferencing a char * when you don't know where it might be pointing. (This includes volatile char* which you'd use for MMIO.) So having the compiler+linker put char variables in separate words isn't a complete solution, just a performance hack if true byte stores are slow.
PS: More about Alpha:
Alpha is interesting for a lot of reasons: one of the few clean-slate 64-bit ISAs, not an extension to an existing 32-bit ISA. And one of the more recent clean-slate ISAs, Itanium being another from several years later which attempted some neat CPU-architecture ideas.
From the Linux Alpha HOWTO.
When the Alpha architecture was introduced, it was unique amongst RISC architectures for eschewing 8-bit and 16-bit loads and stores. It supported 32-bit and 64-bit loads and stores (longword and quadword, in Digital's nomenclature). The co-architects (Dick Sites, Rich Witek) justified this decision by citing the advantages:
Byte support in the cache and memory sub-system tends to slow down accesses for 32-bit and 64-bit quantities.
Byte support makes it hard to build high-speed error-correction circuitry into the cache/memory sub-system.
Alpha compensates by providing powerful instructions for manipulating bytes and byte groups within 64-bit registers. Standard benchmarks for string operations (e.g., some of the Byte benchmarks) show that Alpha performs very well on byte manipulation.
Not only are x86 CPUs capable of reading and writing a single byte, all modern general purpose CPUs are capable of it. More importantly most modern CPUs (including x86, ARM, MIPS, PowerPC, and SPARC) are capable of atomically reading and writing single bytes.
I'm not sure what Stroustrup was referring to. There used to be a few word addressable machines that weren't capable of 8-bit byte addressing, like the Cray, and as Peter Cordes mentioned early Alpha CPUs didn't support byte loads and stores, but today the only CPUs incapable of byte loads and stores are certain DSPs used in niche applications. Even if we assume he means most modern CPUs don't have atomic byte load and stores this isn't true of most CPUs.
However, simple atomic loads and stores aren't of much use in multithreaded programming. You also typically need ordering guarantees and a way to make read-modify-write operations atomic. Another consideration is that while CPU a may have byte load and store instructions, compiler isn't required to use them. A compiler, for example, could still generate the code Stroustrup describes, loading both b and c using a single word load instruction as an optimization.
So while you do need a well defined memory model, if only so the compiler is forced to generate the code you expect, the problem isn't that modern CPUs aren't capable of loading or storing anything smaller than a word.
The author seems to be concerned about thread 1 and thread 2 getting into a situation where the read-modify-writes (not in software, the software does two separate instructions of a byte size, somewhere down the line logic has to do a read-modify-write) instead of the ideal read modify write read modify write, becomes a read read modify modify write write or some other timing such that both read the pre-modified version and the last one to write wins. read read modify modify write write, or read modify read modify write write or read modify read write modify write.
The concern is to start with 0x1122 and one thread wants to make it 0x33XX the other wants to make it 0xXX44, but with for example a read read modify modify write write you end up with 0x1144 or 0x3322, but not 0x3344
A sane (system/logic) design just doesn't have that problem certainly not for a general purpose processor like this, I have worked on designs with timing issues like this but that is not what we are talking about here, completely different system designs for different purposes. The read-modify-write does not span a long enough distance in a sane design, and x86s are sane designs.
The read-modify-write would happen very near the first SRAM involved (ideally L1 when running an x86 in a typical fashion with an operating system capable of running C++ compiled multi-threaded programs) and happen within a few clock cycles as the ram is at the speed of the bus ideally. And as Peter pointed out this is considered to be the whole cache line that experiences this, within the cache, not a read-modify-write between the processor core and the cache.
The notion of "at the same time" even with multi-core systems isn't necessarily at the same time, eventually you get serialized because performance isn't based on them being parallel from beginning to end, it is based on keeping the busses loaded.
The quote is saying variables allocated to the same word in memory, so that is the same program. Two separate programs are not going to share an address space like that. so
You are welcome to try this, make a multithreaded program that one writes to say address 0xnnn00000 the other writes to address 0xnnnn00001, each does a write, then a read or better several writes of the same value than one read, check the read was the byte they wrote, then repeats with a different value. Let that run for a while, hours/days/weeks/months. See if you trip up the system...use assembly for the actual write instructions to make sure it is doing what you asked (not C++ or any compiler that does or claims it will not put these items in the same word). Can add delays to allow for more cache evictions, but that reduces your odds of "at the same time" collisions.
Your example so long as you insure you are not sitting on two sides of a boundary (cache, or other) like 0xNNNNFFFFF and 0xNNNN00000, isolate the two byte writes to addresses like 0xNNNN00000 and 0xNNNN00001 have the instructions back to back and see if you get a read read modify modify write write. Wrap a test around it, that the two values are different each loop, you read back the word as a whole at whatever delay later as you desire and check the two values. Repeat for days/weeks/months/years to see if it fails. Read up on your processors execution and microcode features to see what it does with this instruction sequence and as needed create a different instruction sequence that tries to get the transactions initiated within a handful or so clock cycles on the far side of the processor core.
EDIT
the problem with the quotes is that this is all about language and the use of. "like most modern hardware" puts the whole of the topic/text in a touchy position, it is too vague, one side can argue all I have to do is find one case that is true to make all the rest true, likewise one side could argue if I find one case the all of the rest is not true. Using the word like kind of messes with that as a possible get out of jail free card.
The reality is that a significant percentage of our data is stored in DRAM in 8 bit wide memories, just that we don't access them as 8 bit wide normally we access 8 of them at a time, 64 bits wide. In some number of weeks/months/years/decades this statement will be incorrect.
The larger quote says "at the same time" and then says read ... first, write ... last, well first and last and at the same time don't make sense together, is it parallel or serial? The context as a whole is concerned about the above read read modify modify write write variations where you have one writing last and depending on when that one read determines if both modifications happened or not. Not about at the same time which "like most modern hardware" doesn't make sense things that start off actually parallel in separate cores/modules eventually get serialized if they are aiming at the same flip-flop/transistor in a memory, one eventually has to wait for the other to go first. Being physics based I don't see this being incorrect in the coming weeks/months/years.
This is correct. An x86_64 CPU, just like an original x86 CPU, is not able to read or write anything smaller than an (in this case 64-bit) word from rsp. to memory. And it will not typically read or write less than a whole cache line, though there are ways to bypass the cache, especially in writing (see below).
In this context, though, Stroustrup refers to potential data races (lack of atomicity on an observable level). This correctness issue is irrelevant on x86_64, because of the cache coherency protocol, which you mentioned. In other words, yes, the CPU is limited to whole word transfers, but this is transparently handled, and you as a programmer generally do not have to worry about it. In fact, the C++ language, starting from C++11, guarantees that concurrent operations on distinct memory locations have well-defined behavior, i.e. the one you'd expect. Even if the hardware did not guarantee this, the implementation would have to find a way by generating possibly more complex code.
That said, it can still be a good idea to keep the fact that whole words or even cache lines are always involved at the machine level in the back of your head, for two reasons.
First, and this is only relevant for people who write device drivers, or design devices, memory-mapped I/O may be sensitive to the way it is accessed. As an example, think of a device that exposes a 64-bit write-only command register in the physical address space. It may then be necessary to:
Disable caching. It is not valid to read a cache line, change a single word, and write back the cache line. Also, even if it were valid, there would still be a great risk that commands might be lost because the CPU cache is not written back soon enough. At the very least, the page needs to be configured as "write-through", which means writes take immediate effect. Therefore, an x86_64 page table entry contains flags that control the CPU's caching behavior for this page.
Ensure that the whole word is always written, on the assembly level. E.g. consider a case where you write the value 1 into the register, followed by a 2. A compiler, especially when optimizing for space, might decide to overwrite only the least significant byte because the others are already supposed to be zero (that is, for ordinary RAM), or it might instead remove the first write because this value appears to be immediately overwritten anyway. However, neither is supposed to happen here. In C/C++, the volatile keyword is vital to prevent such unsuitable optimizations.
Second, and this is relevant for almost any developer writing multi-threaded programs, the cache coherency protocol, while neatly averting disaster, can have a huge performance cost if it is "abused".
Here's a – somewhat contrived – example of a very bad data structure. Assume you have 16 threads parsing some text from a file. Each thread has an id from 0 to 15.
// shared state
char c[16];
FILE *file[16];
void threadFunc(int id)
{
while ((c[id] = getc(file[id])) != EOF)
{
// ...
}
}
This is safe because each thread operates on a different memory location. However, these memory locations would typically reside on the same cache line, or at most are split over two cache lines. The cache coherency protocol is then used to properly synchronize the accesses to c[id]. And herein lies the problem, because this forces every other thread to wait until the cache line becomes exclusively available before doing anything with c[id], unless it is already running on the core that "owns" the cache line. Assuming several, e.g. 16, cores, cache coherency will typically transfer the cache line from one core to another all the time. For obvious reasons, this effect is known as "cache line ping-pong". It creates a horrible performance bottleneck. It is the result of a very bad case of false sharing, i.e. threads sharing a physical cache line without actually accessing the same logical memory locations.
In contrast to this, especially if one took the extra step of ensuring that the file array resides on its own cache line, using it would be completely harmless (on x86_64) from a performance perspective because the pointers are only read from, most the time. In this case, multiple cores can "share" the cache line as read-only. Only when any core tries to write to the cache line, it has to tell the other cores that it is going to "seize" the cache line for exclusive access.
(This is greatly simplified, as there are different levels of CPU caches, and several cores might share the same L2 or L3 cache, but it should give you a basic idea of the problem.)
Not sure what Stroustrup meant by "WORD".
Maybe it is the minimum size of memory storage of the machine?
Anyway not all machines were created with 8bit (BYTE) resolution.
In fact I recommend this awesome article by Eric S. Raymond describing some of the history of computers:
http://www.catb.org/esr/faqs/things-every-hacker-once-knew/
"... It used also to be generally known that 36-bit architectures
explained some unfortunate features of the C language. The original
Unix machine, the PDP-7, featured 18-bit words corresponding to
half-words on larger 36-bit computers. These were more naturally
represented as six octal (3-bit) digits."
Stroustrup is not saying that no machine can perform loads and stores smaller than their native word size, he is saying that a machine couldn't.
While this seems surprising at first, it's nothing esoteric.
For starter, we will ignore the cache hierarchy, we will take that into account later.
Assume there are no caches between the CPU and the memory.
The big problem with memory is density, trying to put more bits possible into the smallest area.
In order to achieve that it is convenient, from an electrical design point of view, to expose a bus as wider as possible (this favours the reuse of some electrical signals, I haven't looked at the specific details though).
So, in architecture where big memories are needed (like the x86) or a simple low-cost design is favourable (for example where RISC machines are involved), the memory bus is larger than the smallest addressable unit (typically the byte).
Depending on the budget and legacy of the project the memory can expose a wider bus alone or along with some sideband signals to select a particular unit into it.
What does this mean practically?
If you take a look at the datasheet of a DDR3 DIMM you'll see that there are 64 DQ0–DQ63 pins to read/write the data.
This is the data bus, 64-bit wide, 8 bytes at a time.
This 8 bytes thing is very well founded in the x86 architecture to the point that Intel refers to it in the WC section of its optimisation manual where it says that data are transferred from the 64 bytes fill buffer (remember: we are ignoring the caches for now, but this is similar to how a cache line gets written back) in bursts of 8 bytes (hopefully, continuously).
Does this mean that the x86 can only write QWORDS (64-bit)?
No, the same datasheet shows that each DIMM has the DM0–DM7 ,DQ0–DQ7 and DQS0–DQS7 signals to mask, direct and strobe each of the 8 bytes in the 64-bit data bus.
So x86 can read and write bytes natively and atomically.
However, now it's easy to see that this could not be the case for every architecture.
For instance, the VGA video memory was DWORD (32-bit) addressable and making it fit in the byte addressable world of the 8086 led to the messy bit-planes.
In general specific purpose architecture, like DSPs, could not have a byte addressable memory at the hardware level.
There is a twist: we have just talked about the memory data bus, this is the lowest layer possible.
Some CPUs can have instructions that build a byte addressable memory on top of a word addressable memory.
What does that mean?
It's easy to load a smaller part of a word: just discard the rest of the bytes!
Unfortunately, I can't recall the name of the architecture (if it even existed at all!) where the processor simulated a load of an unaligned byte by reading the aligned word containing it and rotating the result before saving it in a register.
With stores, the matter is more complex: if we can't simply write the part of the word that we just updated we need to write the unchanged remaining part too.
The CPU, or the programmer, must read the old content, update it and write it back.
This is a Read-Modify-Write operation and it is a core concept when discussing atomicity.
Consider:
/* Assume unsigned char is 1 byte and a word is 4 bytes */
unsigned char foo[4] = {};
/* Thread 0 Thread 1 */
foo[0] = 1; foo[1] = 2;
Is there a data race?
This is safe on x86 because they can write bytes, but what if the architecture cannot?
Both threads would have to read the whole foo array, modify it and write it back.
In pseudo-C this would be
/* Assume unsigned char is 1 byte and a word is 4 bytes */
unsigned char foo[4] = {};
/* Thread 0 Thread 1 */
/* What a CPU would do (IS) What a CPU would do (IS) */
int tmp0 = *((int*)foo) int tmp1 = *((int*)foo)
/* Assume little endian Assume little endian */
tmp0 = (tmp0 & ~0xff) | 1; tmp1 = (tmp1 & ~0xff00) | 0x200;
/* Store it back Store it back */
*((int*)foo) = tmp0; *((int*)foo) = tmp1;
We can now see what Stroustrup was talking about: the two stores *((int*)foo) = tmpX obstruct each other, to see this consider this possible execution sequence:
int tmp0 = *((int*)foo) /* T0 */
tmp0 = (tmp0 & ~0xff) | 1; /* T1 */
int tmp1 = *((int*)foo) /* T1 */
tmp1 = (tmp1 & ~0xff00) | 0x200; /* T1 */
*((int*)foo) = tmp1; /* T0 */
*((int*)foo) = tmp0; /* T0, Whooopsy */
If the C++ didn't have a memory model these kinds of nuisances would have been implementation specific details, leaving the C++ a useless programming language in a multithreading environment.
Considering how common is the situation depicted in the toy example, Stroustrup stressed out the importance of a well-defined memory model.
Formalizing a memory model is hard work, it's an exhausting, error-prone and abstract process so I also see a bit of pride in the words of Stroustrup.
I have not brushed up on the C++ memory model but updating different array elements is fine.
That's a very strong guarantee.
We have left out the caches but that doesn't really change anything, at least for the x86 case.
The x86 writes to memory through the caches, the caches are evicted in lines of 64 bytes.
Internally each core can update a line at any position atomically unless a load/store crosses a line boundary (e.g. by writing near the end of it).
This can be avoided by naturally aligning data (can you prove that?).
In a multi-code/socket environment, the cache coherency protocol ensures that only a CPU at a time is allowed to freely write to a cached line of memory (the CPU that has it in the Exclusive or Modified state).
Basically, the MESI family of protocol use a concept similar to locking found the DBMSs.
This has the effect, for the writing purpose, of "assigning" different memory regions to different CPUs.
So it doesn't really affect the discussion of above.

sendmsg + raw ethernet + several frames

I use linux 3.x and modern glibc(2.19).
I would like send several Ethernet frames without switch from kernel/user space forth and back.
I have MTU = 1500, and I want to send 800 KB.
I init receiver address like this:
struct sockaddr_ll socket_address;
socket_address.sll_ifindex = if_idx.ifr_ifindex;
socket_address.sll_halen = ETH_ALEN;
socket_address.sll_addr[0] = MY_DEST_MAC0;
//...
After that I can call sendto/sendmsg 800KB / 1500 ~= 500 times and all works fine, but this require user space <-> kernel negotiation ~ 500 * 25 times per second. I want avoid it.
I try to init struct msghdr::msg_iov with appropriate info,
but get error "message too long", looks like msghdr::msg_iov can not describe something with size > MTU.
So question is it possible to send many raw Ethernet frame on Linux from userspace at once?
PS
The data (800KB) I get from file, and read it to memory. So struct iovec good for me, I can create suitable amount of Ethernet header and have to iovec per 1500 packet, one point to data, one point to Ethernet header.
Whoa.
My last company made realtime hidef video encoding hardware. In the lab, we had to blast 200MB / second across a bonded link, so I have some experience with this. What follows is based upon that.
Before you can tune, you must measure. You don't want to do multiple syscalls, but can you prove with timing measurement that the overhead is significant?
I use a wrapper routine around clock_gettime that gives back time of day with nanosecond precision (e.g. (tv_sec * 100000000) + tv_nsec). Call this [herein] "nanotime".
So, for any given syscall, you need a measurement:
tstart = nanotime();
syscall();
tdif = nanotime() - tstart;
For send/sendto/sendmsg/write, do this will small data so you're sure you're not blocking [or use O_NONBLOCK, if applicable]. This gives you the syscall overhead
Why are you going directly to ethernet frames? TCP [or UDP] is usually fast enough and modern NIC cards can do the envelope wrap/strip in hardware. I'd like to know if there is a specific situation that requires ethernet frames, or was it that you weren't getting the performance you wanted and came up with this as a solution. Remember, you're doing 800KB/s (~1MB/s) and my project was doing 100x-200x more than that over TCP.
What about using two plain write calls to the socket? One for header, one for data [all 800KB]. write can be used on a socket and doesn't have the EMSGSIZE error or restriction.
Further, why do you need your header to be in a separate buffer? When you allocate your buffer, just do:
datamax = 800 * 1024; // or whatever
buflen = sizeof(struct header) + datamax;
buf = malloc(buflen);
while (1) {
datalen = read(fdfile,&buf[sizeof(struct header)],datamax);
// fill in header ...
write(fdsock,buf,sizeof(struct header) + datalen);
}
This works even for the ethernet frame case.
One of the things can also do is use a setsockopt to increase the size of the kernel buffer for your socket. Otherwise, you can send data, but it will be dropped in the kernel before the receiver can drain it. More on this below.
To measure the performance of the wire, add some fields to your header:
u64 send_departure_time; // set by sender from nanotime
u64 recv_arrival_time; // set by receiver when the packet arrives
So, sender sets departure time and does write [just do the header for this test]. Call this packet Xs. receiver stamps this when it arrives. receiver immediately sends back a message to sender [call it Xr] with a departure stamp that and the contents of Xs. When sender gets this, it stamps it with an arrival time.
With the above we now have:
T1 -- time packet Xs departed sender
T2 -- time packet Xs arrived at receiver
T3 -- time packet Xr departed receiver
T4 -- time packet Xr arrived at sender
Assuming you do this on a relatively quiet connection with little to no other traffic and you know the link speed (e.g. 1 Gb/s), with T1/T2/T3/T4 you can calculate the overhead.
You can repeat the measurement for TCP/UDP vs ETH. You may find that it doesn't buy you as much as you think. Once again, can you prove it with precise measurement?
I "invented" this algorithm while working at the aforementioned company, only to find out that it was already part of a video standard for sending raw video across a 100Gb Ethernet NIC card and the NIC does the timestamping in hardware.
One of the other things you may have to do is add some throttle control. This is similar to what bittorrent does or what the PCIe bus does.
When PCIe bus nodes first start up, they communicate how much free buffer space they have available for "blind write". That is, the sender is free to blast up to this much, without any ACK message. As the receiver drains its input buffer, it sends periodic ACK messages to the sender with the number of bytes it was able to drain. sender can add this value back to the blind write limit and keep going.
For your purposes, the blind write limit is the size of the receiver's kernel socket buffer.
UPDATE
Based upon some of the additional information from your comments [the actual system configuration should go, in a more complete form, as an edit to your question at the bottom].
You do have a need for a raw socket and sending an ethernet frame. You can reduce the overhead by setting a larger MTU via ifconfig or an ioctl call with SIOCSIFMTU. I recommend the ioctl. You may not need to set MTU to 800KB. Your CPU's NIC card has a practical limit. You can probably increase MTU from 1500 to 15000 easily enough. This would reduce syscall overhead by 10x and that may be "good enough".
You probably will have to use sendto/sendmsg. The two write calls were based on conversion to TCP/UDP. But, I suspect sendmsg with msg_iov will have more overhead than sendto. If you search, you'll find that most example code for what you want uses sendto. sendmsg seems like less overhead for you, but it may cause more overhead for the kernel. Here's an example that uses sendto: http://hacked10bits.blogspot.com/2011/12/sending-raw-ethernet-frames-in-6-easy.html
In addition to improving syscall overhead, larger MTU might improve the efficiency of the "wire", even though this doesn't seem like a problem in your use case. I have experience with CPU + FPGA systems and communicating between them, but I am still puzzled by one of your comments about "not using a wire". FPGA connected to ethernet pins of CPU I get--sort of. More precisely, do you mean FPGA pins connected to ethernet pins of NIC card/chip of CPU"?
Are the CPU/NIC on the same PC board and the FPGA pins are connected via PC board traces? Otherwise, I don't understand "not using a wire".
However, once again, I must say that you must be able to measure your performance before you blindly try to improve it.
Have you run the test case I suggested for determining the syscall overhead? If it is small enough, trying to optimize for it may not be worth it and doing so may actually hurt performance more severely in other areas that you didn't realize when you started.
As an example, I once worked on a system that had a severe performance problem, such that, the system didn't work. I suspected the serial port driver was slow, so I recoded from a high level language (e.g. like C) into assembler.
I increased the driver performance by 2x, but it contributed less than a 5% performance improvement to the system. It turned out the real problem was that other code was accessing non-existent memory which just caused a bus timeout, slowing the system down measurably [it did not generate an interrupt that would have made it easy to find as on modern systems].
That's when I learned the importance of measurement. I had done my optimization based on an educated guess, rather than hard data. After that: lesson learned!
Nowadays, I never try large optimization until I can measure first. In some cases, I add an optimization that I'm "sure" will make things better (e.g. inlining a function). When I measure it [and because I can measure it], I find out that the new code is actually slower and I have to revert the change. But, that's the point: I can prove/disprove this with hard performance data.
What CPU are you using: x86, arm, mips, etc. At what clock frequency? How much DRAM? How many cores?
What FPGA are you using (e.g. Xilinx, Altera)? What specific type/part number? What is the maximum clock rate? Is the FPGA devoted entirely to logic or do you also have a CPU inside it such as microblaze, nios, arm? Does the FPGA have access to DRAM of it's own [and how much DRAM]?
If you increase the MTU, can the FPGA handle it, from either a buffer/space standpoint or a clock speed standpoint??? If you increase MTU, you may need to add an ack/sync protocol as I suggested in the original post.
Currently, the CPU is doing a blind write of the data, hoping the FPGA can handle it. This means you have an open race condition between CPU and FPGA.
This may be mitigated, purely as a side effect of sending small packets. If you increase MTU too much, you might overwhelm the FPGA. In other words, it is the very overhead you're trying to optimize away, that allows the FPGA to keep up with the data rate.
This is what I meant by unintended consequences of blind optimization. It can have unintended and worse side effects.
What is the nature of the data being sent to the FPGA? You're sending 800KB, but how often?
I am assuming that it is not the FPGA firmware itself for a few reasons. You said the firmware was already almost full [and it is receiving the ethernet data]. Also, firmware is usually loaded via the I2C bus, a ROM, or an FPGA programmer. So, am I correct?
You're sending the data to the FPGA from a file. This implies that it is only being sent once, at the startup of your CPU's application. Is that correct? If so, optimization is not needed because it's an init/startup cost that has little impact on the running system.
So, I have to assume that the file gets loaded many times, possibly a different file each time. Is that correct? If so, you may need to consider the impact of the read syscall. Not just from syscall overhead, but optimal read length. For example, IIRC, the optimal transfer size for a disk-to-disk or file-to-file copy/transfer is 64KB, depending upon the filesystem or underlying disk characteristics.
So, if you're looking to reduce overhead, reading data from a file may have considerably more than having the application generate the data [if that's possible].
The kernel syscall interface is designed to be very low overhead. Kernel programmers [I happen to be one] spend a great deal of time ensuring the overhead is low.
You say your system is utilizing the a lot of CPU time for other things. Can you measure the other things? How is your application structured? How many processes? How many threads? How do they communicate? What is the latency/througput? You may be able to find [can quite probably find] the larger bottlenecks and recode those and you'll get an overall reduction in CPU usage that far exceeds the maximum benefit you'll get from the MTU tweak.
Trying to optimize the syscall overhead may be like my serial port optimization. A lot of effort, and yet the overall results are/were disappointing.
When considering performance, it is important to consider it from an overall system standpoint. In your case, this means CPU, FPGA, and anything else in it.
You say that the CPU is doing a lot of things. Could/should some of those algorithms go into the FPGA? Is the reason they're not because the FPGA is almost out of space, otherwise you would? Is the FPGA firmware 100% done? Or, is there more RTL to be written? If you're at 90% space utilization in the FPGA, and you'll need more RTL, you may wish to consider going to an FPGA part that has more space for logic, possibly with a higher clock rate.
In my video company, we used FPGAs. We used the largest/fastest state-of-the-art part the FPGA vendor had. We also used virtually 100% of the space for logic and required the part's maximum clock rate. We were told by the vendor that we were the largest consumer of FPGA resources of any of their client companies worldwide. Because of this, we were straining the vendors development tools. Place-and-route would frequently fail and have to be rerun to get correct placement and meet timing.
So, when an FPGA is almost full with logic, the place-and-route can be difficult to achieve. It might be a reason to consider a larger part [if possible]

Recommended CRC16 polynomials for data logging application

I'm writing a data logging application (running on a microcontroller) which will write data to ordinary, embedded NOR-type serial flash memory (in this case - an AT25DF161.)
Each packet of data (240 or 496 bytes) will be logged individually to the flash one after another. I figure the most common failure in the flash memory would be a stuck bit - typically "0", the non-erased state. I need to be able to detect single bit events, typically -at most- two per record (I assume this as a worst case after 100,000 write cycles.)
I'm using a processor which has a built in 16-bit CRC calculation module, so there's no performance implication for using less or more terms - so what decisions would I need to make to decide on an optimum polynomial?
Use a standard polynomial. You can find a list to choose from here.
Look at the paper by Philip Koopman, "Cyclic Redundancy Code (CRC) Polynomial Selection for Embedded Networks". He analyzes a number of 16-bit polynomials, measuring their error detection capability for different message lengths. As you'll see from his paper, they are not all created equally. For a small number of errors (HD=2 in your case) and a fairly large block, 0xBAAD might be a good choice.

How to use DSP to speed-up a code on OMAP?

I'm working on a video codec for OMAP3430. I already have code written in C++, and I try to modify/port certain parts of it to take advantage of the DSP (the SDK (OMAP ZOOM3430 SDK) I have has an additional DSP).
I tried to port a small for loop which is running over a very small amount of data (~250 bytes), but about 2M times on different data. But the overload from the communication between CPU and DSP is much more than the gain (if I have any).
I assume this task is much like optimizing a code for the GPU's in normal computers. My question is porting what kind of parts would be beneficial? How do GPU programmers take care of such tasks?
edit:
GPP application allocates a buffer of size 0x1000 bytes.
GPP application invokes DSPProcessor_ReserveMemory to reserve a DSP virtual address space for each allocated buffer using a size that is 4K greater than the allocated buffer to account for automatic page alignment. The total reservation size must also be aligned along a 4K page boundary.
GPP application invokes DSPProcessor_Map to map each allocated buffer to the DSP virtual address spaces reserved in the previous step.
GPP application prepares a message to notify the DSP execute phase of the base address of virtual address space, which have been mapped to a buffer allocated on the GPP. GPP application uses DSPNode_PutMessage to send the message to the DSP.
GPP invokes memcpy to copy the data to be processed into the shared memory.
GPP application invokes DSPProcessor_FlushMemory to ensure that the data cache has been flushed.
GPP application prepares a message to notify the DSP execute phase that it has finished writing to the buffer and the DSP may now access the buffer. The message also contains the amount of data written to the buffer so that the DSP will know just how much data to copy. The GPP uses DSPNode_PutMessage to send the message to the DSP and then invokes DSPNode_GetMessage to wait to hear a message back from the DSP.
After these the execution of DSP program starts, and DSP notifies the GPP with a message when it finishes the processing. Just to try I don't put any processing inside the DSP program. I just send a "processing finished" message back to the GPP. And this still consumes a lot of time. Could that be because of the internal/external memory usage, or is it merely because of the communication overload?
The OMAP3430 does not have an on board DSP, it has a IVA2+ Video/Audio decode engine hooked to the system bus and the Cortex core has DSP-like SIMD instructions. The GPU on the OMAP3430 is a PowerVR SGX based unit. While it does have programmable shaders and i don't believe there is any support for general purpose programming ala CUDA or OpenCL. I could be wrong but I've never heard of such support
If your using the IVA2+ encode/decode engine that is on board you need to use the proper libraries for this unit and it only supports specific codecs from that I know. Are you trying to write your own library to this module?
If your using the Cortex's built in DSPish (SIMD instructions), post some code.
If your dev board has some extra DSP on it, what is the DSP and how is it connected to the OMAP?
As to the desktop GPU question, in the case of video decode you use the vender supplied function libraries to make calls to the hardware, there are several, VDAPU for Nvidia on linux, similar libraries on windows(PureViewHD I think its called). ATI also has both linux and windows libraries for their on board decode engines, i don't know the names.
I don't know what the time base your transfering data in is, but I know the TMS32064x which is listed on the specsheet for the SDK has a very powerful DMA engine. (I'm assuming it's the orignal ZOOM OMAP34X MDK. It says it has a 64xx.) I would hope the OMAP has something simalar, use them to their fullest advantage. I would recomend setting up "ping-pong" buffers in the interal ram of the 64xx and using the SDRAM as shared memory with the transfers handle by DMA. External RAM is going to be a bottleneck on any of the 6xxx series parts so keep whatever you can locked into internal memory to improve performance. Typically these parts will have the ability to bus 8 32bits words to the processor core once it's in internal memory, but that vary from part to part based on what level cache it allows you to map as direct access ram. Cost sensitive parts from TI move the "mappable memory" farther away than some of the other chips. Also all the manuals for the parts are available from TI for free download in PDF. They even gave me hardcopies for free of the TMS320C6000 CPU and Instruction Set manual and many other books.
As far as programming is concerned you may need to use some of the "processor intrinsics" or inline assembly to optimize any math you are doing. For the 64xx favor integer operation when possible because it doesn't have a built in floating point core. (Those are in the 67xx series.) If look at the excution units and you can map your calculations such that the different parts target different operations in a manner which can occur in a single cycle then you will be able to achive the best performance out of those parts. The instruction set manual list the types of ops that are performed by each execution unit. If you can break you calculation up in to a dual data flow sets and unwind the loops a bit the compiler will be "nicer" to you when full optimizaiton is on. This is due to the fact that the processor is broken up into a left and a right side with nearly identical execution units on either side.
Hope this helps.
From the measurements I did, one messaging cycle between CPU and DSP takes about 160us. I don't know whether this is because of the kernel I use, or the bridge driver; but this is a very long time for a simple back & forth messaging.
It seems that it is only reasonable to port an algorithm to DSP if the total computational load is comparable to the time required for messaging; and if the algorithm is suitable for simultaneous computing on CPU and DSP.