Updating QML from highly dynamic C++ data model: Timer vs Property binding - c++

Assume a C++ business model which is potentially changing its underlying data very quickly (let's say up to 1000 Hz).
In my specific case, this would be a wrapper around a network data callback mechanism (a ROS-subscriber to be more precise) running in its own thread. Think of the data as some sensor reading. Further assume that we are dealing with simple data, e.g. a single double value or a string in the worst case.
What would be the best way to visualize such data in a QML component? Best here is with respect to being gentle with resources, minimizing the delay between model state and view, and being as clean code as possible?
I came up with the following options:
Bind the data as property of the model to the view, and emit a dataChanged() signal on every update of the data:
Pro: Elegant code, lowest resource usage in case of low data update rates.
Con: High signal frequency will spam the GUI and maximize resource utilization, potentially rendering the system unresponsive.
Use a QML-Timer and poll the model at a specific interval (e.g. 10 - 60 Hz):
Pro: Upper limit on resource utilization, steady rate
Con: Timer itself consumes resources; not as elegant as data binding; if data generation/callback rate is lower than timer rate, unnecessary polling calls have to be made.
Use data binding as in 1. but throttle the signal emitting frequency inside the C++ model using a QTimer (or another timer like boost's), i.e., only emit the dataChanged() signal at specific intervals:
Pro: Same as 2.
Con: Same as 2.
Combine Timer from 3. with some kind of check whether the data actually changed, prior to emitting the signal, thus avoiding unnecessary signal emitting if data has not changed:
Pro: Same as 3.
Con: same as 3. but even less elegant; increased risk to induce a bug because logic is more complicated; model is harder to reuse as the data changed check highly depends on the nature of the data
Did I miss an option (besides not producing so much data in the first place)? What would be your choice or the most QT/QML way of realizing this?

First of all, have you established that there is a performance problem?
I mean granted, 1000 updates per second is plenty, but is this likely to force GUI updates a 1000 times a second? Or maybe the GUI only updates when a new frame is rendered? Any remotely sane GUI framework would to exactly that. Even if your data changes a gazzilion times per second, GUI changes will only be reflected at the frame rate the GUI is being rendered in.
That being said, you could work to reduce the strain on the backend. A typical "convenience" implementation would burden the system, and even if any contemporary platform that Qt supports could handle 1000 Hz for an object or two, if those are many, then you should work to reduce that. And even if not directly detrimental to system performance, efficiency is always nice as long as it doesn't come at a too high cost development wise.
I would not suggest to directly update data using a property or model interface, as that would flood with signal notifications. You should handle the underlying data updates silently on the low level, and only periodically inform the property or model interface of the changes.
If your update is continuous, just set a timer, 30 Hz or even less outta be OK if a human is looking at it.
If it is not, you could use a "dirty" flag. When data changes, set the flag and start the update timer as a single shot, when the timer triggers and notifies the update, it clears the flag and suspends. This way you avoid running the timer without any updates being necessary.

Related

Multithreading in Direct 3D 12

Hi I am a newbie learning Direct 3D 12.
So far, I understood that Direct 3D 12 is designed for multithreading and I'm trying to make my own simple multithread demo by following the tutorial by braynzarsoft.
https://www.braynzarsoft.net/viewtutorial/q16390-03-initializing-directx-12
Environment is windows, using C++, Visual Studio.
As far as I understand, multithreading in Direct 3D 12 seems, in a nutshell, populating command lists in multiple threads.
If it is right, it seems
1 Swap Chain
1 Command Queue
N Command Lists (N corresponds to number of threads)
N Command Allocators (N corresponds to number of threads)
1 Fence
is enough for a single window program.
I wonder
Q1. When do we need multiple command queues?
Q2. Why do we need multiple fences?
Q3. When do we submit commands multiple times?
Q4. Does GetCPUDescriptorHandleForHeapStart() return value changes?
Q3 comes from here.
https://developer.nvidia.com/sites/default/files/akamai/gameworks/blog/GDC16/GDC16_gthomas_adunn_Practical_DX12.pdf
Purpose of Q4 is I thought of calling the function once and store the value for reuse, it didn't change when I debugged.
Rendering loop in my mind is (based on Game Loop pattern), for example,
Thread waits for fence value (eg. Main thread).
Begin multiple threads to populate command lists.
Wait all threads done with population.
ExecuteCommandLists.
Swap chain present.
Return to 1 in the next loop.
If I am totally misunderstanding, please help.
Q1. When do we need multiple command queues?
Read this https://learn.microsoft.com/en-us/windows/win32/direct3d12/user-mode-heap-synchronization:
Asynchronous and low priority GPU work. This enables concurrent execution of low priority GPU work and atomic operations that enable one GPU thread to consume the results of another unsynchronized thread without blocking.
High priority compute work. With background compute it is possible to interrupt 3D rendering to do a small amount of high priority compute work. The results of this work can be obtained early for additional processing on the CPU.
Background compute work. A separate low priority queue for compute workloads allows an application to utilize spare GPU cycles to perform background computation without negative impact on the primary rendering (or other) tasks.
Streaming and uploading data. A separate copy queue replaces the D3D11 concepts of initial data and updating resources. Although the application is responsible for more details in the Direct3D 12 model, this responsibility comes with power. The application can control how much system memory is devoted to buffering upload data. The app can choose when and how (CPU vs GPU, blocking vs non-blocking) to synchronize, and can track progress and control the amount of queued work.
Increased parallelism. Applications can use deeper queues for background workloads (e.g. video decode) when they have separate queues for foreground work.
Q2. Why do we need multiple fences?
All gpu work is asynchronous. So you can think of fences as low level tools to achieve the same result as futures/coroutines. You can check if the work has been completed, wait for work to complete or set an event on completion. You need a fence whenever you need to guarantee a resource holds the output of work (when resource barriers are insufficient).
Q4. Does GetCPUDescriptorHandleForHeapStart() return value changes?
No it doesn't.
store the value for reuse, it didn't change when I debugged.
The direct3d12 samples do this, you should know them intimately if you want to become proficient.
Rendering loop in my mind is (based on Game Loop pattern), for example,
That sounds okay, but I urge you to look at the direct3d12 samples and steal the patterns (and the code) they use there.

Using timers with performance-critical software (Qt)

I am developing an application that is responsible of moving and managing robots over an UDP connection.
The application needs to:
Read joystick/user input using SDL.
Generate and send a control packet to the robot every 20 milliseconds (UDP)
Receive and decode response packets from the robot (~20 msecs). This was implemented with the signal/slot mechanism and does not require a timer.
Receive and process robot messages for debugging reasons. This is not time-regulated.
Update the UI regularly to keep the user notified about the status of the robot (e.g. battery voltage). For most cases, I have also used Qt's signal/slot mechanism.
Use a watchdog that disables the robot if no response is received after 1 second. The watchdog is reset when the application receives a robot packet (~20 msecs)
For the moment, I have implemented all of the above. However, the application fails to send the packets regularly when the watchdog is activated or when two or more QTimer objects are used. The application would generally work, but I would not consider it "production ready". I have tried to use the precision flags of the timers (Qt::Precise, Qt::Coarse and Qt::VeryCoarse), but I still experienced problems.
Notes:
The code is generally well organized, there are no "god objects" in the code base (most source files are less than 150 lines long and only create the necessary dependencies).
Most of the times, I use QTimer::singleShot() (e.g. I will only send the next packet once the current packet has been sent).
Where we use timers:
To read joystick input (~50 msecs, precise timer)
To send robot packets (~20 msecs, precise timer)
To update some aspects of the UI (~500 msecs, coarse timer)
To update the elapsed time since the robot was enabled (~100 msecs, precise timer)
To implement a watchdog (put the application and robot in safe state if 1000 msecs have passed without a robot response)
Note: the watchdog is feed when we receive a response packet from the robot (~20 msecs)
Do you have any recommendations for using QTimer objects with performance-critical code (any idea is welcome). Note that I have also tried to use different threads, but it has caused me more problems, since the application would not be in "sync", thus failing to effectively control the robots that we have tested.
Actually, I seem to have underestimated Qt's timer and event loop performance. On my system I get on average around 20k nanoseconds for an event loop cycle plus the overhead from scheduling a queued function call, and a timer with interval 1 millisecond is rarely late, most of the timeouts are a few thousand nanoseconds short of a millisecond. But it is a high end system, on embedded hardware it may be a lot worse.
You should take the time and profile your target system and Qt build to determine whether it can indeed run snappy enough, and based on those measurements, adjust your timings to compensate for the system delays to get your events scheduled more on time.
You should definitely keep the timer thread as free as possible, because if you block it by IO or extensive computation, your timer will not be accurate. Use a dedicated thread to schedule work and extra worker threads to do the actual work. You may also try playing with thread priorities a bit.
Worst case scenario, look for 3rd party high performance event loop implementations or create your own and potentially, also a faster signaling mechanism as ell. As I already mentioned in the comments, Qt's inter-thread queued signals are very slow, at least compared to something like indirect function calls.
Last but not least, if you want to do task X every N units of time, it will only be only possible if task X takes N units of time or less on your system. You need to make this consideration for each task, and for all tasks running concurrently. And in order to get accurate scheduling, you should measure how long did task X took, and if less than its frequency, schedule the next execution in the time remaining, otherwise execute immediately.

How to deal with unresponsive UI in high-volume model update scenarios

We're using Qt 4.8.2, and we have a model/view design (subclassed QAbstractItemModel and QTreeview, specifically). The model/treeview follows the typical philosophy where the view drives the model - we don't populate the model until the user expands the corresponding treeview nodes.
Once a node has been expanded & data is visible, it is subject to display updates that occur in worker (non-UI) threads. Right now, when a worker thread produces a change that may affect the treeview, it emits a "change" signal, which maps to a slot in our model.
The problem is that these change signals can be emitted with great frequency (say, 1500 events over a second), sometimes, but they may apply to what the treeview currently displays (and can therefore be ignored). When this happens, the UI thread becomes unresponsive as (I presume) the signals all queue up and the UI thread has to deal with them before it can resume responding to user interaction.
The time it takes to respond to a change signal is very small, but it appears that the UI thread only "eats" the signals after a small delay - presumably to avoid excessive updating resulting in screen flicker or other annoyances.
The result is that the UI remains frozen for 5 or 6 seconds, with very low CPU activity during this time (presumably because the signals are coming in fast enough the handler is still waiting for break in the action); once all the signals are queued up, the thread finally consumes the work in the queue and resolves it in a few milliseconds.
I have several thoughts on this:
Is there some setting such that I can increase the aggressiveness by which the UI thread handles incoming signals?
Is it feasible at all to manage the updates of the model in a separate thread? My instinct is to say no - it would seem the Qt machinery is too dependent on exclusive ownership of the model, and putting the appropriate lock protection around its access would be complicated and violate the whole point of the slot/signal paradigm.
I can think up more complex schemes to deal with this signals in a secondary thread; for example, the UI could maintain a separate multithread-visible (non-model) data structure that could be queried to determine whether a change signal needed to be sent. Similarly, I could maintain a separate queue that the worker threads use, where I could batch up change events into a single signal (which I could deliver no more than twice a second, for example). But these methods strike me as a bit byzantine for a problem that I assume must not be totally uncommon in the domain of Qt UI programming.
We had a similar application with large updates to underlying data. The question comes down to:
1500 updates per second will result in how many changes in the GUI?
If the answer is there will be less than 6 changes, then the model should only emit 6 data changes per second. If this is the case, when underlying data changed, check if this change will change the GUI or not, only emit data changed signal from model when necessary.
If the answer is there will be more than 6 gui changes per second, the answer we have is no human being can see more then 3 changes per second. The underlying data change should not update GUI at all. Use a 250 milli second timer, in the timer event, check the cells needs to be updated, and update them.

Emitting a signal to one object in a set dynamically

I have a situation where I have a single Emitter object and a set of Receivers. The receivers are of the same class, and actually represent a set of devices of the same type. I'm using the Qt framework.
The Emitter itself first gets a signal asking for information from one of the devices.
In the corresponding slot, the Emitter has to check to see which of the Receivers are 'ready', and then send its own signal to request data to one of the devices (whichever is ready first).
The Emitter receives signals very quickly, on the order of milliseconds. There are three ways I can think of safely requesting data from only one of the devices (the devices live in their own threads, so I need a thread-safe mechanism). The number of devices isn't static, and can change. The total number of devices is quite small (definitely under 5-6).
1) Connect to all the devices when they are added or removed. Emit the one request and have the devices objects themselves filter out whether the request is for them using some specific device tag. This method is nice because the request slot where the check occurs will execute in a dedicated thread's context, but wasteful as the number of devices go up.
2) Connect and disconnect from the object within the Emitter on the fly when it's necessary to send a request.
3) Use QMetaObject::invokeMethod() when its necessary to send a request.
Performance is important. Does anyone know which method is the 'best', or if there's a better one altogether?
Regards
Pris
Note: To clarify: Emitter gets a signal from the application, to get info by querying the device. Crazy ASCII art go:
(app)<---->(emitter)<------>(receivers)<--|-->physical devices
Based on the information you have provided I would still recommend a Reactor implementation. If you don't use ACE then you can implement your own. The basic architecture is as follows:
use select to wake up when signal or data is received from the App.
If there is a socket ready on the sending list then you just pick one and send it data
When data is sent the Receiver removes itself from the set of sockets/handlers that are available
When data is processed the Reciever re-registers itself to the list of available recipients.
The reason I suggested ACE is because it has one of the simplest to use implementations of the Reactor pattern.
I'm amusing here this is multi thread environment.
If you are restricted to Qt signal / slot system between then the answer for your specific questions:
1) is definitely not the way to go. On an emit from the Emitter a total number of events equal to the number of Receivers will be queued for the thread(s) event loops of the devices, then the same number of slot calls will occur once the thread(s) reach those event. Even if most of the lost just if(id!=m_id) return; on their first line, its a significant amount of things going on in the core of Qt. Place a breakpoint in one of your slots that is evoked by a Qt::QueuedConnection signal and validate this looking at the actual stack trace. Its usually at least 4 call deep from the xyEventLoop::processEvents(...), so "just returning" is definitely not "free" in terms of time.
2) Not sure how Qt's inner implementation actually is, but from what I know connecting and disconnecting most likely include inserting and removing the sender and receiver into some lists, which are most likely accessed with QMutex locking. - might also be "expensive" time-wise, and rapidly connecting and disconnecting is definitely not a best practice.
3) Probably the least "expensive time-wise" solution you can find that is still using Qt's singnal-slot system.
optionally) Take a look at QSignalMapper. It is designed exactly for what you planned to do in option 1).
There are more optimal solutions to communicate between your Emitter and Receivers, but as a best practice I'd first choose the option that is most easy to use and fast to implement, yet has a chance of being fast enough run-time (that is option 3). ). Then, when its done, see if it meets your performance requirements. If it does not, and only then, consider using shared memory with mutexes in a data provider - data consumer architecture (Emitter thread rapidly post request data in a circular list, while the Receiver thread(s) reads them whenever have time, then post results back a similar way, while the Emitter thread constantly polls for done results.)

With a single file descriptor, Is there any performance difference between select, poll and epoll and ...?

The title really says it all.
The and ... means also include pselect and ppoll..
The server project I'm working on basically structured with multiple threads. Each
thread handles one or more sessions. All the threads are identical. The protocol
takes care of which thread will host the session.
I'm using an inhouse socket class that wraps things up. The point of interest is a checkread call which calls either poll (linux) or select (windows).
In summary each thread currently calls poll on a single socket. From what I can tell, using epoll would only be of benefit if this thread was looking at multiple sockets such as what you'd get in say an HTTP server. That's not what I'm doing in my case. And the class only handles a single socket at a time.
There is some brief discussion about edge and level triggering in the man pages for epoll. I'm not really sure what it means. In the socket class I see an optimization in the windows part of the code that shortcuts the select call with an ioctlsocket & FIONREAD to check if there is any data. Wondering if that would return > 0 even if a complete UDP packet hadn't arrived at the time of the call. Is this what edge triggering is in epoll?
In some rudimentary testing, I'm also seeing no noticeable difference between using select and poll.
I can see that using ppoll might be of benefit though due to greater precision in the timeout. Any thoughts?
And yes, I am trying to optimize throughput for a session that is receiving lots of data. The server is more Network & Disk bound than CPU.
The main difference between epoll vs select or poll is that epoll scales a lot better when run in a single thread. I don't know how this would compare to using a multithreaded server using select or poll.
Look at this http://monkey.org/~provos/libevent/libevent-benchmark2.jpg
The reason for this(as far as I can tell) is that when you are using select or poll you must loop through all the connected sockets to determine which ones have data to be read. When you are using epoll, it keeps a seperate array which contains references only to sockets which have data to be read. This saves you lots of loop cycles, and the difference becomes more and more noticeable the more sockets that are connected.
Another thing to look into if performance ever becomes a major issue is io completion ports(windows only) and kqueue(FreeBSD only). It's also important to remember that epoll is linux only. In most cases select or poll will work just fine.
In the case of a single file descriptor, select and poll are more efficient than epoll due to being much simpler. (epoll has some overhead which doesn't make itself useful with only a single socket)
According to the link: http://www.intelliproject.net/articles/showArticle/index/io_multiplexing.
If you use only one descriptor:
select: 201 micro seconds.
poll: 159 micro seconds.
epoll: 176 micro seconds.
Seems poll will be a better solution in such situation.
If you have only a single socket, what's the point of polling in the first place? Wouldn't the best performance then be by just using blocking read/write?
Wrt. the performance, with only a single file descriptor I don't think there is much, if any, difference between the various approaches. If you really care, I suppose you could measure, but I find it difficult that this would particularly matter for the overall performance of your program.
Level/edge triggering. Consider you're monitoring a signal, for simplicity say some voltage in a line. Edge triggering means that something triggers when the voltage goes over or under some specific limit. Level triggering means that something is considered to be in a triggered state as long as the voltage is over/under the limit. That is, edge triggering triggers when some event happens (crossing some threshold), level triggering reflects the state of some "thing" (in this case, voltage).
To get back to network programming, and edge triggered system might be one where you get some kind of signal when a packet is received. If you don't handle the event then the signal is lost. A level triggered system, OTOH, is something like asking "is there data waiting in the buffer for me?"; if you don't handle the event and ask again, the data will still be there waiting for you.