I'm trying to understand the userland part of the Raspberry Pi graphics driver code from https://github.com/raspberrypi/userland
My understanding so far is:
- a firmware blob runs in the GPU and offers an OpenGL-like interface which, on lower levels, is based on message (byte-array) passing on top of one of multiple 28-bit-word FIFOs called VCHIQ (the other VCHIQ queues are irrelevant for graphics)
- on the CPU part, OpenGL calls are turned into messages to the GPU. Access to the low-level facility (either the message queue or VCHIQ -- I haven't found that part yet in the code) requires a Linux kernel module, but no high-level logic happens in there.
- the GPU part is closed, but that's okay for my purposes. The (ARM) CPU part is, AFAIK, open
My ultimate goal is to get communication with the GPU working on bare metal (without Linux), but with the closed firmware blob intact. As a first goal, I want to understand how an OpenGL call is actually passed to the GPU. Anything beyond that is not part of this question.
However, I'm stuck at finding the actual code for this. The OpenGL calls use RPC_CALL* and in turn RPC_DO, which calls khronos_server_lock_func_table(). However, that function seems to be missing from the code, and to my surprise, I couldn't find anything useful about it on Google.
My questions:
- am I still on the ARM CPU side, or did I move to GPU land without noticing? If the latter is the case, where did I cross that line?
- Assuming I'm still on the CPU side -- where is the code for that function? Is it open at all, or do we actually have closed parts left around on the CPU side here? All sources on the web seem to indicate that the code for the CPU is 100% open.
- at which point does the implementation of the C OpenGL functions actually send a message to the GPU? I'm somewhat expecting a call to the kernel functionality that represents VCHIQ to be happening at some point, probably implemented as a device file.
I don't fully understand how do you intend to access the GPU without using Linux, and I am not that familiar with the technicalities, but some time ago I've been digging into the GPU for my private project so I'll tell you what I know.
The GPU is VideoCore IV and its documentation is available on Broadcom's website.
Also, on the Raspberry Pi Wiki you can see on the picture on the left that VCHIQ is in the kernel driver, so you might look for the implementation details in the kernel's source code.
Maybe this might be of some help too: VideoCore IV Programmer's Manual. About the document:
This is a independent documentation project based on a combination of static analysis and trial and error on real hardware. This work is 100% independent from and not sanctioned by or connected with Broadcom or its agents. No Broadcom documents or materials were used beyond those publicly available.
As for the software itself, The Khronos Group provides OpenGL ES and OpenVG implementation, but it's not open source. You can get the documentation from their website, but I doubt you'll find anything on such low level.
Hope it helps.
Following on this thread I reimplemented my image processing code to send in 10 images at once (i.e. I now have the num property of the input blob set to 100 instead of 10).
However, the time required to process this batch is 10 times bigger than originally. Which means that I did not get any performance increase.
Is that reasonable or did I make something wrong?
I am running Caffe in CPU mode. Unfortunately GPU mode is not an option for me.
Update: Caffe now natively supports parallel processing of multiple images when using multiple GPUs. Though it seems relatively simple to implement base on the current implementation of GPU parallelism, at the moment there's no similar support for parallel processing on multiple CPUs.
Considering that the main problem with implementing parallelism is the syncing you need during training If you just want to process your images in parallel (as opposed to training the model), then you could load several copies of the same network to memory (whether through python with multiprocessing or c++ with multi-threading), and process each image on a different network. It would be simple and quite effective, especially if you load the networks once and then just process a large amount of images. Nevertheless, GPUs are much faster :)
Caffe doesn't process multiple images in parallel, the only saving you get by batch processing several images is in the time it takes to transfer the image data back and forth between Caffe's framework, which could be significant when dealing with the GPU.
IIRC there are several attempts to make Caffe process images in parallel, but most focus on the GPU implementation (CUDNN, CUDA Streams etc.), with few attempts to add parallelism to the CPU code (OpenBLAS's multithread mode, or simply running on multiple threads). Of those I believe only the CUDNN option is currently part of the stable version of Caffe, but obviously requires a GPU. You can try to look at one of the pull requests about this matter on Caffe's github page and see if it works for you, but note that it might cause compatibilities issue with your current version.
This is one such version that in the past I've used, though it's no longer maintained: https://github.com/BVLC/caffe/pull/439
I've also noticed in the last comment of the above issue that there's some speed up to the CPU code on this pull request as well, though I've never tried it myself: https://github.com/BVLC/caffe/pull/2610
I've been working on a facetracking system last couple of months and now I need to make everything run in parallel to increase the performance.
The main cpp file is:
int _tmain(int argc, _TCHAR* argv[])
{
cFrame.initCamFrames(20, 1600, 1200, 3); //INITIATES BUFFER FOR CAM FRAMES, 20 frames, res:1600x1200, 3bytes per pixel.
eyeTracking.initTrackingSystem(&cFrame); //INITIATES EYETRACKING SOFTWARE WITH POINTER TO THE BUFFER WHERE EYETRACKINGSOFTWARE GETS THE FRAMES TO SEARCH WITHIN. (opencv)
directShow directShowClass;
directShowClass.initiateDirectShow(false, &cFrame); //INITIATES DIRECTSHOW WITH POINTER TO BUFFER WHERE IT SHOULD SAVE FRAMES FROM CAM
directShowClass.runDirectShow(); //START CAPTURING FRAMES INTO BUFFER
eyeTracking.runTrackingSystem(); //START SEARCH FOR FACE AND EYES.
system("pause");
directShowClass.stopDirectShow();
}
I want "directShowClass.runDirectShow();" and "eyeTracking.runTrackingSystem();" to run in real parallel. now I think that they run as threads in pseudo-parallel. (simple printf in each method occur mixed up in the terminal).
I guess that making a program run in parallel is not that simple as I would like it to be. But I guess that it is possible :D
Please give me some advise where to start searching for information about how to paralellisize.
I have a dual core processor.
Thanks!
Printf is not thread-safe,aka it can mix up buffers like you encountered. You can run the process in pseudo-parallel (like switch each call to another processing step) or run it in hardware-concurrency (std::thread, pthreads, windows thread, boost::thread).
if you have a dual core processor you surely can take advantage of multi-core processing, I would suggest to use boost.
Just to clear out, by using threads you do get real parallelism. But remember that your computer is also running in it's cores other processes in background that occupy CPU time, so your functions are not always being executed.
In order to get some parallelism in C++ you have many options. I name three:
. The oldest most common way is to use the pthread library, which is built in into almost every compiler.
. The new C++ standard, called C++11 includes some native libraries to deal with multi-threading, you can check that out, but it is still not supported by every compiler. And most compilers that support it have only partial functionality. You also need to activate the standard explicitly. For example:
gcc -std=c++11
. Finally, if you are in the mood for some "higher level" stuff, you can put some effort in learning about the OpenMP framework, which uses pragma directives to annotate parallel tasks. The framework will then deal with all the creation of threads, so you can use your time in some other stuff.
P.S: The reason why the output comes out mixed is not because the threads run in pseudo-parallel, but because they are concurrently writting on the buffer. So when the buffer is dumped you see it as the threads wrote it. If any, this is a proof that they are actually running in parallel, but you are making them write their output in the same buffer ;)
I am creating a basic signal generator and decided to use my audio card as the analogue output. I chose to use DirectSound because... it seemed like a good option.
I have it up and running quite nicely, but I now realize that my code using secondary buffers and as such any other sounds on the computer get mixed in with my generated signal. This is something of an issue, as when I'm running a motor I don't want it to get sent an MSN poke noise as a command.
In order to gain total control I've attempted to take over the system by setting my cooperative level to DSSCL_WRITEPRIMARY. All in all this strategy is really giving me a headache as I am running into error after error trying to get this set up. The documentation on using the primary buffer isn't great and I can't find any really good examples.
So my question is:
Does anyone have a good, working example of taking over and writing to the primarybuffer.
Is there a simpler way of outputing a waveform to the audio card, and ensuring that my application has full and sole control?
Thank you
only thing I've seen related is:
http://blogs.msdn.com/b/matthew_van_eerde/archive/2009/04/03/sample-wasapi-exclusive-mode-event-driven-playback-app-including-the-hd-audio-alignment-dance.aspx
I'm considering writing a new Windows GUI app, where one of the requirements is that the app must be very responsive, quick to load, and have a light memory footprint.
I've used WTL for previous apps I've built with this type of requirement, but as I use .NET all the time in my day job WTL is getting more and more painful to go back to. I'm not interested in using .NET for this app, as I still find the performance of larger .NET UIs lacking, but I am interested in using a better C++ framework for the UI - like Qt.
What I want to be sure of before starting is that I'm not going to regret this on the performance front.
So: Is Qt fast?
I'll try and qualify the question by examples of what I'd like to come close to matching: My current WTL app is Programmer's Notepad. The current version I'm working on weighs in at about 4mb of code for a 32-bit, release compiled version with a single language translation. On a modern fast PC it takes 1-3 seconds to load, which is important as people fire it up often to avoid IDEs etc. The memory footprint is usually 12-20 mb on 64-bit Win7 once you've been editing for a while. You can run the app non-stop, leave it minimized, whatever and it always jumps to attention instantly when you switch to it.
For the sake of argument let's say I want to port my WTL app to Qt for potential future cross-platform support and/or the much easier UI framework. I want to come close to if not match this level of performance with Qt.
Just chiming in with my experience in case you still haven't solved it or anyone else is looking for more experience. I've recently developed a pretty heavy (regular QGraphicsView, OpenGL QGraphicsView, QtSQL database access, ...) application with Qt 4.7 AND I'm also a stickler for performance. That includes startup performance of course, I like my applications to show up nearly instantly, so I spend quite a bit of time on that.
Speed: Fantastic, I have no complaints. My heavy app that needs to instantiate at least 100 widgets on startup alone (granted, a lot of those are QLabels) starts up in a split second (I don't notice any delay between doubleclicking and the window appearing).
Memory: This is the bad part, Qt with many subsystems in my experience does use a noticeable amount of memory. Then again this does count for the many subsystems usage, QtXML, QtOpenGL, QtSQL, QtSVG, you name it, I use it. My current application at startup manages to use about 50 MB but it starts up lightning fast and responds swiftly as well
Ease of programming / API: Qt is an absolute joy to use, from its containers to its widget classes to its modules. All the while making memory management easy (QObject) system and mantaining super performance. I've always written pure win32 before this and I wil never go back. For example, with the QtConcurrent classes I was able to change a method invocation from myMethod(arguments) to QtConcurrent::run(this, MyClass::myMethod, arguments)and with one single line a non-GUI heavy processing method was threaded. With a QFuture and QFutureWatcher I could monitor when the thread had ended (either with signals or just method checking). What ease of use! Very elegant design all around.
So in retrospect: very good performance (including app startup), quite high memory usage if many submodules are used, fantastic API and possibilities, cross-platform
Going native API is the most performant choice by definition - anything other than that is a wrapper around native API.
What exactly do you expect to be the performance bottleneck? Any strict numbers? Honestly, vague ,,very responsive, quick to load, and have a light memory footprint'' sounds like a requirement gathering bug to me. Performance is often overspecified.
To the point:
Qt's signal-slot mechanism is really fast. It's statically typed and translates with MOC to quite simple slot method calls.
Qt offers nice multithreading support, so that you can have responsive GUI in one thread and whatever else in other threads without much hassle. That might work.
Programmer's Notepad is an text editor which uses Scintilla as the text editing core component and WTL as UI library.
JuffEd is a text editor which uses QScintilla as the text editing core component and Qt as UI library.
I have installed the latest versions of Programmer's Notepad and JuffEd and studied the memory footprint of both editors by using Process Explorer.
Empty file:
- juffed.exe Private Bytes: 4,532K Virtual Size: 56,288K
- pn.exe Private Bytes: 6,316K Virtual Size: 57,268K
"wtl\Include\atlctrls.h" (264K, ~10.000 lines, scrolled from beginning to end a few times):
- juffed.exe Private Bytes: 7,964K Virtual Size: 62,640K
- pn.exe Private Bytes: 7,480K Virtual Size: 63,180K
after a select all (Ctrl-A), cut (Ctrl-X) and paste (Ctrl-V)
- juffed.exe Private Bytes: 8,488K Virtual Size: 66,700K
- pn.exe Private Bytes: 8,580K Virtual Size: 63,712K
Note that while scrolling (Pg Down / Pg Up pressed) JuffEd seemed to eat more CPU than Programmer's Notepad.
Combined exe and dll sizes:
- juffed.exe QtXml4.dll QtGui4.dll QtCore4.dll qscintilla2.dll mingwm10.dll libjuff.dll 14Mb
- pn.exe SciLexer.dll msvcr80.dll msvcp80.dll msvcm80.dll libexpat.dll ctagsnavigator.dll pnse.dll 4.77 Mb
The above comparison is not fair because JuffEd was not compiled with Visual Studio 2005, which should generate smaller binaries.
We have been using Qt for multiple years now, developing a good size UI application with various elements in the UI, including a 3D window. Whenever we hit a major slowdown in app performance it is usually our fault (we do a lot of database access) and not the UIs.
They have done a lot of work over the last years to speed up drawing (this is where most of the time is spent). In general unless you really do implement a kind of editor usually there is not a lot of time spent executing code inside the UI. It mostly waits on input from the user.
Qt is a very nice framework, but there is a performance penalty. This has mostly to do with painting. Qt uses its own renderer for painting everything - text, rectangles, you name it... To the underlying window system every Qt application looks like a single window with a big bitmap inside. No nested windows, no nothing. This is good for flicker-free rendering and maximum control over the painting, but this comes at the price of completely forgoing any possibility for hardware acceleration. Hardware acceleration is still noticeable nowadays, e.g. when filling large rectangles in a single color, as is often the case in windowing systems.
That said, Qt is "fast enough" in almost all cases.
I mostly notice slowness when running on a Macbook whose CPU fan is very sensitive and will come to life after only a few seconds of moderate CPU activity. Using the mouse to scroll around in a Qt application loads the CPU a lot more than scrolling around in a native application. The same goes for resizing windows.
As I said, Qt is fast enough but if increased battery draining matters to you, or if you care about very smooth window resizing, then you don't have much choice besides going native.
Since you seem to consider a 3 second application startup "fast", it doesn't sound like you would care at all about Qt's performance, though. I would consider 3 second startup dog-slow, but opinions on that vary naturally.
The overall program performance will of course be up to you, but I don't think that you have to worry about the UI. Thanks to the graphics scene and OpenGL support you can do fast 2D/3D graphics too.
Last but not least, an example from my own experience:
Using Qt on Linux/Embedded XP machine with 128 MB of Ram. Windows uses MFC, Linux uses Qt. Custom user GUI with lots of painting, and a regular admin GUI with controls/widgets. From a user's point of view, Qt is as fast as MFC. Note: it was a full screen program that could not be minimized.
Edited after you have added more info:
you can expect a larger executable size (especially with Qt MinGW) and more memory usage. In your case, try playing with one of the IDEs (e.g. Qt Creator) or text editors written in Qt and see what you think.
I personally would choose Qt as I've never seen any performance hit for using it. That said, you can get a little closer to native with wxWidgets and still have a cross-platform app. You'll never be quite as fast as straight Win32 or MFC (and family) but you gain a multi-platform audience. So the question for you is, is this worth a small trade-off?
My experience is mostly with MFC, and more recently with C#. MFC is pretty close to the bare metal so unless you define a ton of data structure, it should be pretty quick.
For graphics painting, I always find it useful to render to a memory bitmap, and then blt that to the screen. It looks faster, and it may even be faster, because it's not worrying about clipping.
There usually is some kind of performance problem that creeps in, in spite of my trying to avoid it. I use a very simple way to find these problems: just wait until it's being subjectively slow, pause it, and examine the call stack. I do this a number of times - 10 is usually more than enough. It's a poor man's profiler but works well, no fuss, no bother. The problem is always something no one could have guessed, and usually easy to fix. This is why it works.
If there are dialogs of any complexity, I use my own technique, Dynamic Dialogs, because I'm spoiled. They are not for the faint-of-heart, but are very flexible and perform nicely.
I once made an app to determine the "primeness" of a number (whether it was prime or composite).
I first attempted a Qt GUI, and it took 5 hours to return the answer for 1,299,827 on a computer with 8GB of RAM and an AMD 1090T # 4GHz running no other foreground processes under Linux.
My second attempt used a QProcess of a console application that used the exact same code. On a laptop with 1.3GB of RAM and a 1.4GHz CPU, the response came with no perceivable delay.
I will not deny, though, that it is far easier than GTK+ or Win32, and it handles things quite nicely, but separate intensive processing ENTIRELY from the GUI if you use it.