C++, OpenCV, & Kinect: Processing speed goes down - c++

I use C++ (Visual Studio 2015) and OpenCV (ver 3.2.0) to process data sent from Kinect v1. My C++ program has no problem when it starts debugging for the first time. After it stops debugging and re-start debugging, however, it gets very slow.
I am suspecting that the program closes without releasing some memory (i.e., memory leak). I am aware of that I would need to use the delete function to release the memory if I use the new function. But I didn't use the new function in the C++ program (I neither used the malloc() function, which is equivalent to the new function in C programs).
For OpenCV, I use the destroyAllWindows function at the end of the program. For Kinect v1, I also use the NuiShutdown(), Release(), and CloseHandle() functions at the end of the program.
Is there anything else I need do to release the memory (e.g., releasing memory associated with Mat in OpenCV)? Or is something else causing the decrease in processing speed?
I'd appreciate your help. Thanks.

After first run disconnect Kinect then reconnect and try second run.
If all goes well now then the problem is most likely stuck thread. The device access is usually handled by separate threads and especially with USB they can get stuck (in case of error or sync problem between accessing form host and expecting on device side) until you disconnect device (not sure which Kinect driver are you using but JUNGO version which NuiShutdown() infers have this problem). You can also check task manager before disconnection if there are not some stuck processes left after first run.
To remedy this you need to find out what are you doing wrong during access. It could be:
wrong USB port
use the back side not front slots.
invalid USB transfer request
device is always waiting for specific set of commands or stream and waits until it does not receive it so it blocks all other things. So using unsupported commands or reading in wrong times or sizes of packets can cause this.
USB communication is out of sync
PC host can timeout in case you do not have enough CPU power while critical operation is processed (or have opened too many apps on background).
This can be caused also by wrong gfx driver as I suspect you are using rendering ... Intel HD graphics can generate such problems with ease especially on notebooks. Try to disable any rendering in your app or at least limit rendering to OpenGL 1.0 to see if speed is the same in between runs. If this is the case the whole desktop usually flickers or is not repainting parts of apps ... and animations are sometimes sluggish.
Another problem might be a debugger. If without it all is well then debugger is the problem and you can not solve it. Debugging while accessing IO can cause sync and timeout problems especially with USB.
To check for memory leaks you can simply see how much free memory you got before 1st run and compare it to values after 1st,2nd,3th .. runs if the value lowers you got something stuck somewhere. After app close all the memory belonging to app is freed by OS so even if you forget some delete that does not matter unless some thread is still running ...
Some USB drivers based on libUSB I encountered got also problem with Handle leaks. But that behaves differently ... all runs fine until there are no free handles. After that OS is non functional you can not open any window,app, anything ... until any app is closed.
[Edit1] Front USB slots
Front slots are usually connected to motherboard with relatively long cable (usually flat and not very well shielded) so it is more susceptible to noise. Also as it is located usually around HDD and above high frequency parts of the motherboard it also induce it into the USB feed. All this degrades the quality of USB signal causing much much bigger rejection rate hence lowering sync capability and also the overall usable bandwidth.
If you compare that with backside USB ports they have no cables but are connected directly in PCB with short and well shielded paths so the connection quality is much much better.
So if you use device demanding high bandwith or synchronism then front ports are a bad choice.

Related

Inner workings of Raspberry Pi userland graphics driver (not firmware or kernel part)

I'm trying to understand the userland part of the Raspberry Pi graphics driver code from https://github.com/raspberrypi/userland
My understanding so far is:
- a firmware blob runs in the GPU and offers an OpenGL-like interface which, on lower levels, is based on message (byte-array) passing on top of one of multiple 28-bit-word FIFOs called VCHIQ (the other VCHIQ queues are irrelevant for graphics)
- on the CPU part, OpenGL calls are turned into messages to the GPU. Access to the low-level facility (either the message queue or VCHIQ -- I haven't found that part yet in the code) requires a Linux kernel module, but no high-level logic happens in there.
- the GPU part is closed, but that's okay for my purposes. The (ARM) CPU part is, AFAIK, open
My ultimate goal is to get communication with the GPU working on bare metal (without Linux), but with the closed firmware blob intact. As a first goal, I want to understand how an OpenGL call is actually passed to the GPU. Anything beyond that is not part of this question.
However, I'm stuck at finding the actual code for this. The OpenGL calls use RPC_CALL* and in turn RPC_DO, which calls khronos_server_lock_func_table(). However, that function seems to be missing from the code, and to my surprise, I couldn't find anything useful about it on Google.
My questions:
- am I still on the ARM CPU side, or did I move to GPU land without noticing? If the latter is the case, where did I cross that line?
- Assuming I'm still on the CPU side -- where is the code for that function? Is it open at all, or do we actually have closed parts left around on the CPU side here? All sources on the web seem to indicate that the code for the CPU is 100% open.
- at which point does the implementation of the C OpenGL functions actually send a message to the GPU? I'm somewhat expecting a call to the kernel functionality that represents VCHIQ to be happening at some point, probably implemented as a device file.
I don't fully understand how do you intend to access the GPU without using Linux, and I am not that familiar with the technicalities, but some time ago I've been digging into the GPU for my private project so I'll tell you what I know.
The GPU is VideoCore IV and its documentation is available on Broadcom's website.
Also, on the Raspberry Pi Wiki you can see on the picture on the left that VCHIQ is in the kernel driver, so you might look for the implementation details in the kernel's source code.
Maybe this might be of some help too: VideoCore IV Programmer's Manual. About the document:
This is a independent documentation project based on a combination of static analysis and trial and error on real hardware. This work is 100% independent from and not sanctioned by or connected with Broadcom or its agents. No Broadcom documents or materials were used beyond those publicly available.
As for the software itself, The Khronos Group provides OpenGL ES and OpenVG implementation, but it's not open source. You can get the documentation from their website, but I doubt you'll find anything on such low level.
Hope it helps.

Slow or delayed loading of my application

Question:
My question is what will be the impact on my application memory footprint or performance if I replace functions like foo1 (which I have in my code) below with foo2. This function is called frequently in application.
#define SIZE 5000
void foo1()
{
double data[SIZE];
// ....
}
void foo2()
{
std::unique_ptr< double[] > data( new double[SIZE] );
// ....
}
Context:
My MFC application loads really slow on the embedded device running Windows 7 after implementation of new features/modules. The same application loads fast on PC. At least one of the difference and what I suspect is the cause is RAM on embedded unit is really low, just 768 MB.
I debugged it to find out where does this delay occurs and recorded time stamps within application in loading process. What I discovered was interesting. When I double click the exe, it takes about a minute to record the first time stamp and after that it runs fast, so all the delay is right there.
My theory is that windows is taking all this time to setup the environment for exe and once done, it runs fast. The reason I suspect this is there are a lot big structures declared on stack in the application to the point I had to move some of them to heap to get rid of stack overflow errors even on PC with new features.
What do you think is the cause of the slow or more accurately delayed loading of executable on low RAM machine? Do you think it will fix up if I move all of the big structures from stack to heap?
There are not a lot of things that take a minute in modern day computing. Not on a machine with an embedded version of Windows either. Not the processor, not the RAM, not the disk.
Except one, networking is still based on assumptions that were last valid in the 1980s. TCP/IP has taken over as the only protocol in common use. But has a flaw, there is no reasonable way to discover how long a connection attempt might take. So connection timeouts are based on absolute worst-case conditions, trying to hook up to a machine half-way around the world, connected with a modem that needs to spin up the drum to load the program.
The minimum timeout on Windows is hard-baked at 45 seconds. And, in general, a condition that certainly isn't unlikely in an embedded machine. You might have hooked it up to a network to get it initialized but it isn't connected anymore or the machine you copied from might no longer be powered up.
Chase it down by first looking for a disconnected disk drive, very common. Next use SysInternals' utilities like TcpView to look for network activity, like trying to connect to a CRL server. Use Process Explorer to find out where the program is stuck. Mark Russinovich' blog is excellent to show his trouble-shooting strategies using these tools. Good luck with it.

fcam - n900 - mysterious reboot

I wrote an application to take pictures and well it takes pictures, but it also randomly reboots.
How can I determine what caused it? Do I need to observe FCam events or can I just write a simple application that takes pictures?
Walter
there are few causes of reboots and hints where to look for, related to camera on N900/Maemo5:
(huge) memory leaks mentioned above by Walter may drain your swap and cause reboot
there is HW watchdog which fires when some binary app messes heavily with pointers, array boundaries, etc and hangs CPU on itself (then process, which resets HW WD periodically, does not reset it and HW WD pulls power off)
DSP/ISP subsystem may still be less than perfect, coupled with own DMA it might cause interesting, entertaining sometimes, behavior.
xwindow/SGX can have interesting behavior along camera working.
now, this is still Debian machine only ARM not x86 - enable R&D mode and get syslog giving you some info to start analysis

Why can't I set master volume for USB/Firewire Audio interface with IAudioEndpointVolume::SetMasterVolumeLevelScalar

I am trying to fix an Audacity bug that revolves around portmixer. The output/input level is settable using the mac version of portmixer, but not always in windows. I am debugging portmixer's window code to try to make it work there.
Using IAudioEndpointVolume::SetMasterVolumeLevelScalar to set the master volume works fine for onboard sound, but using pro external USB or firewire interfaces like the RME Fireface 400, the output volume won't change, although it is reflected in Window's sound control panel for that device, and also in the system mixer.
Also, outside of our program, changing the master slider for the system mixer (in the taskbar) there is no effect - the soundcard outputs the same (full) level regardless of the level the system says it is at. The only way to change the output level is using the custom app that the hardware developers give with the card.
The IAudioEndpointVolume::QueryHardwareSupport function gives back ENDPOINT_HARDWARE_SUPPORT_VOLUME so it should be able to do this.
This behavior exists for both input and output on many devices.
Is this possibly a Window's bug?
It is possible to workaround this by emulating (scaling) the output, but this is not preferred as it is not functionally identical - better to let the audio interface do the scaling (esp. for input if it involves a preamp).
The cards you talk about -like the RME- ones simply do not support setting the master or any other level through software, and there is not much you can do about it. This is not a Windows bug. One could argue that giving back ENDPOINT_HARDWARE_SUPPORT_VOLUME is a bug though, but that likely originates from the driver level, not Windows itself.
The only solution I found so far is hooking up a debugger (or adding a dll hook) to the vendor supplied software and looking at the DeviceIOControl calls it makes (those are the ones used to talk to the hardware) while setting the volume in the vendor software. Pretty hard to do this for every single card, but probably worth doing for a couple of pro cards. Especially for Audacity, for open source audio software it's actually not that bad so I can imagine some people being really happy if the volume on their card could be set by it. (at the time we were exclusively using an RME Multiface I spent quite some time in figuring out the DeviceIOControl calls, but in the end it was definitely worth it as I could set the volume in dB for any point in the matrix)

Device browsing problem

I’m writing file browsing software and I want it to work correctly with all portable devices, such as cameras, smart phones and so on. My program shows thumbnails, so I need to read the content of each file.
Now I’m facing some problems:
With both my photo cameras I can open only one ISteam from device. For every additional stream I get ERROR_BUSY error. This is inconvenient as I get thumbnails in several background threads.
I can open multiple streams from my smart phone, but I cannot seek that streams! As workaround I have to copy the entire stream to temp file system location and process it there.
I wonder what it depends on. Device file system? Driver implementation? Or anything else?
Those seem like very reasonable restrictions on file access to a peripheral with very limited memory (limited fast volatile memory and code EEPROM are more of a concern than size of the flash card).
It's not the file system (which is almost universally FAT or FAT32 for these kinds of devices) or even limitations in the Windows driver (although the limits are probably enforced there to avoid confusing the device) but limited number of file descriptors in the device's embedded file access code.
As a result, you'll probably have to have workarounds for these and other unsupported driver features.
On a related note, multiple threads usually aren't the right way to do background I/O operations. If your devices support OVERLAPPED operation then you can use that along with events and MsgWaitForMultipleObjects (which replaces PeekMessage or GetMessage in the classic GetMessage/TranslateMessage/DispatchMessage main event loop). By keeping everything on one thread you avoid synchronization issues, most race conditions, and prevent the following problem:
Your customer wants to select and use
one of the files on her device, but
oh no, the only IStream is being used
on a thread reading thumbnails. Too
bad, have to wait for that thread to
finish its current file.