How to use DSP to speed-up a code on OMAP? - c++

I'm working on a video codec for OMAP3430. I already have code written in C++, and I try to modify/port certain parts of it to take advantage of the DSP (the SDK (OMAP ZOOM3430 SDK) I have has an additional DSP).
I tried to port a small for loop which is running over a very small amount of data (~250 bytes), but about 2M times on different data. But the overload from the communication between CPU and DSP is much more than the gain (if I have any).
I assume this task is much like optimizing a code for the GPU's in normal computers. My question is porting what kind of parts would be beneficial? How do GPU programmers take care of such tasks?
edit:
GPP application allocates a buffer of size 0x1000 bytes.
GPP application invokes DSPProcessor_ReserveMemory to reserve a DSP virtual address space for each allocated buffer using a size that is 4K greater than the allocated buffer to account for automatic page alignment. The total reservation size must also be aligned along a 4K page boundary.
GPP application invokes DSPProcessor_Map to map each allocated buffer to the DSP virtual address spaces reserved in the previous step.
GPP application prepares a message to notify the DSP execute phase of the base address of virtual address space, which have been mapped to a buffer allocated on the GPP. GPP application uses DSPNode_PutMessage to send the message to the DSP.
GPP invokes memcpy to copy the data to be processed into the shared memory.
GPP application invokes DSPProcessor_FlushMemory to ensure that the data cache has been flushed.
GPP application prepares a message to notify the DSP execute phase that it has finished writing to the buffer and the DSP may now access the buffer. The message also contains the amount of data written to the buffer so that the DSP will know just how much data to copy. The GPP uses DSPNode_PutMessage to send the message to the DSP and then invokes DSPNode_GetMessage to wait to hear a message back from the DSP.
After these the execution of DSP program starts, and DSP notifies the GPP with a message when it finishes the processing. Just to try I don't put any processing inside the DSP program. I just send a "processing finished" message back to the GPP. And this still consumes a lot of time. Could that be because of the internal/external memory usage, or is it merely because of the communication overload?

The OMAP3430 does not have an on board DSP, it has a IVA2+ Video/Audio decode engine hooked to the system bus and the Cortex core has DSP-like SIMD instructions. The GPU on the OMAP3430 is a PowerVR SGX based unit. While it does have programmable shaders and i don't believe there is any support for general purpose programming ala CUDA or OpenCL. I could be wrong but I've never heard of such support
If your using the IVA2+ encode/decode engine that is on board you need to use the proper libraries for this unit and it only supports specific codecs from that I know. Are you trying to write your own library to this module?
If your using the Cortex's built in DSPish (SIMD instructions), post some code.
If your dev board has some extra DSP on it, what is the DSP and how is it connected to the OMAP?
As to the desktop GPU question, in the case of video decode you use the vender supplied function libraries to make calls to the hardware, there are several, VDAPU for Nvidia on linux, similar libraries on windows(PureViewHD I think its called). ATI also has both linux and windows libraries for their on board decode engines, i don't know the names.

I don't know what the time base your transfering data in is, but I know the TMS32064x which is listed on the specsheet for the SDK has a very powerful DMA engine. (I'm assuming it's the orignal ZOOM OMAP34X MDK. It says it has a 64xx.) I would hope the OMAP has something simalar, use them to their fullest advantage. I would recomend setting up "ping-pong" buffers in the interal ram of the 64xx and using the SDRAM as shared memory with the transfers handle by DMA. External RAM is going to be a bottleneck on any of the 6xxx series parts so keep whatever you can locked into internal memory to improve performance. Typically these parts will have the ability to bus 8 32bits words to the processor core once it's in internal memory, but that vary from part to part based on what level cache it allows you to map as direct access ram. Cost sensitive parts from TI move the "mappable memory" farther away than some of the other chips. Also all the manuals for the parts are available from TI for free download in PDF. They even gave me hardcopies for free of the TMS320C6000 CPU and Instruction Set manual and many other books.
As far as programming is concerned you may need to use some of the "processor intrinsics" or inline assembly to optimize any math you are doing. For the 64xx favor integer operation when possible because it doesn't have a built in floating point core. (Those are in the 67xx series.) If look at the excution units and you can map your calculations such that the different parts target different operations in a manner which can occur in a single cycle then you will be able to achive the best performance out of those parts. The instruction set manual list the types of ops that are performed by each execution unit. If you can break you calculation up in to a dual data flow sets and unwind the loops a bit the compiler will be "nicer" to you when full optimizaiton is on. This is due to the fact that the processor is broken up into a left and a right side with nearly identical execution units on either side.
Hope this helps.

From the measurements I did, one messaging cycle between CPU and DSP takes about 160us. I don't know whether this is because of the kernel I use, or the bridge driver; but this is a very long time for a simple back & forth messaging.
It seems that it is only reasonable to port an algorithm to DSP if the total computational load is comparable to the time required for messaging; and if the algorithm is suitable for simultaneous computing on CPU and DSP.

Related

Locking a process to Cuda core

I'm just getting into GPU processing.
I was wondering if it's possible to lock a new process, or 'launch' a process that is locked to a CUDA core?
For example you may have a small C program that performs an image filter on an index of images. Can you have that program running on each CUDA core that essentially runs forever - reading/writing from it's own memory to system memory and disk?
If this is possible, what are the implications for CPU performance - can we totally offset CPU usage or does the CPU still need to have some input/output?
My semantics here are probably way off. I apologize if what i've said requries some interpretation. I'm not that used to GPU stuff yet.
Thanks.
All of my comments here should be prefaced with "at the moment". Technology is constantly evolving.
was wondering if it's possible to lock a new process, or 'launch' a process that is locked to a CUDA core?
process is mostly a (host) operating system term. CUDA doesn't define a process separately from the host operating system definition of it, AFAIK. CUDA threadblocks, once launched on a Streaming Multiprocessor (or SM, a hardware execution resource component inside a GPU), in many cases will stay on that SM for their "lifetime", and the SM includes an array of "CUDA cores" (a bit of a loose or conceptual term). However, there is at least one documented exception today to this in the case of CUDA Dynamic Parallelism, so in the most general sense, it is not possible to "lock" a CUDA thread of execution to a CUDA core (using core here to refer to that thread of execution forever remaining on a given warp lane within a SM).
Can you have that program running on each CUDA core that essentially runs forever
You can have a CUDA program that runs essentially forever. It is a recognized programming technique sometimes referred to as persistent threads. Such a program will naturally occupy/require one or more CUDA cores (again, using the term loosely). As already stated, that may or may not imply that the program permanently occupies a particular set of physical execution resources.
reading/writing from it's own memory to system memory
Yes, that's possible, extending the train of thought. Writing to it's own memory is obviously possible, by definition, and writing to system memory is possible via the zero-copy mechanism (slides 21/22), given a reasonable assumption of appropriate setup activity for this mechanism.
and disk?
No, that's not directly possible today, without host system interaction, and/or without a significant assumption of atypical external resources such as a disk controller of some sort connected via a GPUDirect interface (with a lot of additional assumptions and unspecified framework). The GPUDirect exception requires so much additional framework, that I would say, for typical usage, the answer is "no", not without host system activity/intervention. The host system (normally) owns the disk drive, not the GPU.
If this is possible, what are the implications for CPU performance - can we totally offset CPU usage or does the CPU still need to have some input/output?
In my opinion, the CPU must still be considered. One consideration is if you need to write to disk. Even if you don't, most programs derive I/O from somewhere (e.g. MPI) and so the implication of a larger framework of some sort is there. Secondly, and relatedly, the persistent threads programming model usually implies a producer/consumer relationship, and a work queue. The GPU is on the processing side (consumer side) of the work queue, but something else (usually) is on the producer side, typically the host system CPU. Again, it could be another GPU, either locally or via MPI, that is on the producer side of the work queue, but that still usually implies an ultimate producer somewhere else (i.e. the need for system I/O).
Additionally:
Can CUDA threads send packets over a network?
This is like the disk question. These questions could be viewed in a general way, in which case the answer might be "yes". But restricting ourselves to formal definitions of what a CUDA thread can do, I believe the answer is more reasonably "no". CUDA provides no direct definitions for I/O interfaces to disk or network (or many other things, such as a display!). It's reasonable to conjecture or presume the existence of a lightweight host process that simply copies packets between a CUDA GPU and a network interface. With this presumption, the answer might be "yes" (and similarly for disk I/O). But without this presumption (and/or a related, perhaps more involved presumption of a GPUDirect framework), I think the most reasonable answer is "no". According to the CUDA programming model, there is no definition of how to access a disk or network resource directly.

What's the best way of implementing a buffer of fixed size when using fread in C++?

Suppose that you have a file of integers and you want to read them one by one.
You have two options for buffering.
Declare an array buffer of size N and use setvbuf to tell fread which buffer to use. Then when calling the function fread to read an integer you write fread(&myInt, sizeof(myInt), 1, inputFile);
Declare the same array buffer but this time don't use the function setvbuf. Instead work on the buffering by yourself. So call fread(buffer, bufferSize*sizeof(int), 1, inputFile)
From my understanding setvbuf exists to make your life easier, but does it come at a cost? Which method would you use in terms of performance?
I would use neither of your examples. I don't think that part of the I/O is the performance bottleneck.
The vbuf is an area for the input routine to place data before putting it into your destination. It could be used as a cache or as a preformatting buffer.
Most of the time, I/O bottlenecks are related to the quantity of data fetched and the number of fetches. For example, reading one byte at a time is less efficient than reading a block of bytes.
Another I/O related bottleneck is the duration between input requests. I/O devices prefer to keep streaming data, non-stop. Some input devices, like hard drives, have an overhead time between when the request is received and when the data starts transmitting. For hard drives, this would be the disk speed up time.
Your best performance is not to waste development time messing with the C or C++ libraries. You need to use hardware assist. Some platforms have a device called a Direct Memory Access controller (DMA). This device can take data from an input source and deliver it to memory without using the CPU. The CPU can be executing instructions while the DMA is transferring data. In order to use hardware assistance, you need to write code at the OS driver level, or access the OS drivers directly.
The C and C++ I/O libraries are designed for a platform independent concept called streams. There may be execution overhead associated with this (such as extra buffering). If you don't care about different platforms, then access the OS drivers directly.
Don't waste your time messing with the C and C++ libraries. Not much performance gain there. More performance lies in accessing the OS drivers directly (or using your own). How and when you access the I/O will show bigger performance gains than tweaking the C and C++ libraries.
Lastly, using the processors data cache effectively will gain you performance too.

CRC/Parity/Hamming Protect 16-bit parallel bus

I've got a Cortex-M4 based MCU linked to a FPGA via a 16-bit parallel memory bus interface. In essence the FPGA behaves like an external memory mapped to the memory space of the MCU: the MCU presents an address followed by either a data word (write) or reading the word presented by the FPGA (read).
I want to protect both read and write against transmission errors both during addressing and data write/read. However, I don't expect many bit errors since the distance between both parts is short.
I can easily implement checking and generating of either parity, hamming codes or CRC inside the FPGA. However, doing the same (checking and generating) in the uC seems comparatively harder since I don't want to cripple the throughput. Without error detection, reading and writing of 16-bit words takes around 4-6 processor cycles and is thus rather fast. Consequently I don't want to spend hundred of cycles on protective measures.
In the end I am looking for a moderately efficient error detection method for 16-bit data that is implemented in a uC in as few cycles as possible.
It's (in my experience) quite rare to protect a parallel bus like this. It's of course done in PC and server class hardware with ECC RAM and so on, but rarely in microcontrollers.
If your particular Cortex-M4 implementation has a hardware CRC block, you might be able to stream the data there, assuming you can simply add a word of CRC to the end of each bus transfer. That would probably still slow it down by at least a factor of 2-3 though, since each word coming to/from the FPGA must also be fed in software to the CRC unit.

C++ Ways to transfer large amounts of data between 32bit applications for video playback

I am aware of the basics of shared memory and inter process communication, but since my application is fairly specific I'm asking this question for general feedback.
I am working on 64 bit machines (MacOS and Win 64), using a 32bit visual coding toolkit. It is not practical to port the toolkit to 64bit at this time so I have memory limitations.
I am working on an application which must be able to scrub (go back and forth based on user input) high quality video at fast speeds. The obvious solutions are:
1 - Keep it all in memory.
2 - Stream from disk.
Putting it all in memory at the moment requires lowering the video quality to an unacceptable point, and streaming from disk causes the scrub to hang while loading.
My current train of thought is to run a master and multiple slave programs. Each slave will load up a segment of the video into ram, and when the master program needs to load a different section of the video it will request this data from the slave and have it transferred over.
My question is, what is an appropriate way to do this?
I suspect shared memory will not allow me to get past the 32bit memory limitations my application currently has. I could do something as simple as pipes, but I was wondering if there is something else that is more suitable.
Ideally this solution would be Mac/Win portable, but since the final solution must reside on a windows box I will opt for windows solutions. Also the easier the better, as I'm not looking to spend weeks in dev time on this.
Thanks in advanced.
I'm going to guess you are (or at least can be) using a 64-bit machine with a 64-bit OS, even though it's impractical to port all your code to 64 bits. I'm also assuming that your machine has enough memory available to hold the data you care about -- the real problem is getting access to enough of that memory from 32-bit code.
If that's the case, then I'd look at Windows' Address Windowing Extensions (AWE) functions, such as AllocateUserPhysicalPages and MapUserPhysicalPages. These work quite a bit like file mapping except that when you map data into your address space, it's already in physical memory instead of having to be read from the disk (i.e., the mapping is much faster).
I would embed or install, depending on your requirements for distribution, one or more instances of Memcached and have one (or more if necessary) thread feed blocks from disk into the memcache.
Once you moved your data onto memcached, you are pretty much immune to 32 bit limitations, especially if the memcached itself runs as a 64 bit process.
Basically you would in your program read from a socket instead of a file, and memcached would be a fancy file cache.

Creating a native application for X86?

Is there a way I could make a C or C++ program that would run without an operating system and that would draw something like a red pixel to the top left corner? I have always wondered how these types of applications are made. Since Windows is written in C I imagine there is a way to do this.
Thanks
If you're writing for a bare processor, with no library support at all, you'll have to get all the hardware manuals, figure out how to access your video memory, and perform whatever operations that hardware requires to get a pixel drawn onto the display (or a sound on the beeper, or a block of memory read from the disk, or whatever).
When you're using an operating system, you'll rely on device drivers to know all this for you. Programs are still written, every day, for platforms without operating systems, but rarely for a bare processor. Many small MPUs come with a support library, usually a set of routines that lets you manipulate whatever peripheral devices they support.
It can certainly be done. You typically write the code in C, and you pretty much have to do everything on your own, with no standard library. To set your pixel, you'd usually load a pointer to the physical address of the screen, and write the correct value to that pointer. Alternatively, on a PC you could consider using the VESA BIOS. In all honesty, it's fairly similar to the way most code for MS-DOS was written (most used MS-DOS to read and write data on disk, but little else).
The core bootloader and the part of the Kernel that bootstraps the OS are written in assembly. See http://en.wikipedia.org/wiki/Booting for a brief writeup of how an operating system boots. There's no way I'm aware of to write a bootloader or Kernel purely in a higher level language such as C or C++ without using assembly.
You need to write a bootstrapper and a loader combination followed by a payload which involves setting the VGA mode manually by interrupt, grabbing a handle to the basic video buffer and then writing a value to the 0th byte.
Start here: http://en.wikipedia.org/wiki/Bootstrapping_(computing)
Without an OS it's difficult to have a loader, which means no dynamic libc. You'd have to link statically, as well as have a decent amount of bootstrap code written in assembly (although it could be provided as object files which you could then link with). Also, since you'd be at the mercy of whatever the system has, you'd be stuck with the VESA video modes (unless you want to write your own graphics driver and subsystem, which you don't).
There is, but not generally from within the OS. Initially, they are an asm stub that's executed from the MBR on the drive. See MBR. For x86 processors, this is generally 16-bit processing code, this generally jumps into the operating system code from here, and upgrades to 32-bit/64-bit mode depending on the operating system and chipset.