I am trying to make some design decisions for an algorithm I am working on. I think that I want to use signals and slots to implement an observer pattern, but I am not sure of a few things.
Here is the algorithm I am working towards:
1.) Load tiles of an image from a large file
1a.) Copy the entire file to a new location
2.) Process the tiles as they are loaded
3.) If the copy has been created, copy the resulting data into the new file
so I envision having a class with functions like loadAllTiles() which would emit signals to tell a processTile() that another tile was ready to be processed, while moving on to load the next tile.
processTile() would perform some calculations, and when complete, signal to writeResults() that a new set of results data was ready to be written. writeResults() would verify that the copying was complete, and start writing the output data.
Does this sound reasonable? is there a way to make loadAllTiles() load in a tile, pass that data somehow to processTile() and then keep going and load the next tile? I was thinking about maybe setting up a list of some sort to store the tiles ready to be processed, and another list for result tiles ready to be written to disk. I guess the disadvantage there is I have to somehow keep those lists in tact, so that multiple threads arent trying to add/remove items from the list.
Thanks for any insight.
It's not completely clear in your question, but it seems that you want to split up the work into several threads, so that the processing of tiles can begin before you finishing loading the entire set.
Consider a multithreaded processing pipeline architecture. Assign one thread per task (loading, copying, processing), and pass around tiles between tasks via Producer-Consumer queues (aka BlockingQueue). To be more precise, pass around pointers (or shared pointers) to tiles to avoid needless copying.
There doesn't seem to be a ready-made thread-safe BlockingQueue class in Qt, but you can roll-up your own using QQueue, QWaitCondition, and QMutex. Here are some sources of inspiration:
Just Software Solutions' blog article.
Java's BlockingQueue
ZThreads's BlockingQueue
While there isn't a ready-made BlockingQueue within Qt, it seems that using signals & slots with the Qt::QueuedConnection option may serve the same purpose. This Qt blog article makes such use of signals and slots.
You may want to combine this pipeline approach with a memory pool or free list of tiles, so that already allocated tiles are recycled in your pipeline.
Here's a conceptual sketch of the pipeline:
TilePool -> TileLoader -> PCQ -> TileProcessor -> PCQ -> TileSaver -\
^ |
\----------------------------------------------------------------/
where PCQ represents a Producer-Consumer queue.
To exploit even more parallelism, you can try thread pools at each stage.
You can also consider checking out Intel's Threading Building Blocks. I haven't tried it myself. Be aware of the GPL licence for the open source version.
Keeping the lists from corruption should be possible with any kind of parallelization lock mechanisms, be it simple locks, semaphores etc etc.
Otherwise, the approach sounds reasonable, even though I would say that the files had to be large in order for this to make sense. As long as they easily fit into memory, I don't see the point of loading them piecewise. Also: How do you plan on extracting tiles without reading the entire image repeatedly?
Related
Hi I am a newbie learning Direct 3D 12.
So far, I understood that Direct 3D 12 is designed for multithreading and I'm trying to make my own simple multithread demo by following the tutorial by braynzarsoft.
https://www.braynzarsoft.net/viewtutorial/q16390-03-initializing-directx-12
Environment is windows, using C++, Visual Studio.
As far as I understand, multithreading in Direct 3D 12 seems, in a nutshell, populating command lists in multiple threads.
If it is right, it seems
1 Swap Chain
1 Command Queue
N Command Lists (N corresponds to number of threads)
N Command Allocators (N corresponds to number of threads)
1 Fence
is enough for a single window program.
I wonder
Q1. When do we need multiple command queues?
Q2. Why do we need multiple fences?
Q3. When do we submit commands multiple times?
Q4. Does GetCPUDescriptorHandleForHeapStart() return value changes?
Q3 comes from here.
https://developer.nvidia.com/sites/default/files/akamai/gameworks/blog/GDC16/GDC16_gthomas_adunn_Practical_DX12.pdf
Purpose of Q4 is I thought of calling the function once and store the value for reuse, it didn't change when I debugged.
Rendering loop in my mind is (based on Game Loop pattern), for example,
Thread waits for fence value (eg. Main thread).
Begin multiple threads to populate command lists.
Wait all threads done with population.
ExecuteCommandLists.
Swap chain present.
Return to 1 in the next loop.
If I am totally misunderstanding, please help.
Q1. When do we need multiple command queues?
Read this https://learn.microsoft.com/en-us/windows/win32/direct3d12/user-mode-heap-synchronization:
Asynchronous and low priority GPU work. This enables concurrent execution of low priority GPU work and atomic operations that enable one GPU thread to consume the results of another unsynchronized thread without blocking.
High priority compute work. With background compute it is possible to interrupt 3D rendering to do a small amount of high priority compute work. The results of this work can be obtained early for additional processing on the CPU.
Background compute work. A separate low priority queue for compute workloads allows an application to utilize spare GPU cycles to perform background computation without negative impact on the primary rendering (or other) tasks.
Streaming and uploading data. A separate copy queue replaces the D3D11 concepts of initial data and updating resources. Although the application is responsible for more details in the Direct3D 12 model, this responsibility comes with power. The application can control how much system memory is devoted to buffering upload data. The app can choose when and how (CPU vs GPU, blocking vs non-blocking) to synchronize, and can track progress and control the amount of queued work.
Increased parallelism. Applications can use deeper queues for background workloads (e.g. video decode) when they have separate queues for foreground work.
Q2. Why do we need multiple fences?
All gpu work is asynchronous. So you can think of fences as low level tools to achieve the same result as futures/coroutines. You can check if the work has been completed, wait for work to complete or set an event on completion. You need a fence whenever you need to guarantee a resource holds the output of work (when resource barriers are insufficient).
Q4. Does GetCPUDescriptorHandleForHeapStart() return value changes?
No it doesn't.
store the value for reuse, it didn't change when I debugged.
The direct3d12 samples do this, you should know them intimately if you want to become proficient.
Rendering loop in my mind is (based on Game Loop pattern), for example,
That sounds okay, but I urge you to look at the direct3d12 samples and steal the patterns (and the code) they use there.
I can't seem to find a good answer to this:
I'm making a game, and I want the logic loop to be separate from the graphics loop. In other words I want the game to go through a loop every X milliseconds regardless of how many frames/second it is displaying.
Obviously they will both be sharing a lot of variables, so I can't have a thread/timer passing one variable back and forth... I'm basically just looking for a way to have a timer in the background that every X milliseconds sends out a flag to execute the logic loop, regardless of where the graphics loop is.
I'm open to any suggestions. It seems like the best option is to have 2 threads, but I'm not sure what the best way to communicate between them is, without constantly synchronizing large amounts of data.
You can very well do multithreading by having your "world view" exchanged every tick. So here is how it works:
Your current world view is pointed to by a single smart pointer and is read only, so no locking is necessary.
Your logic creates your (first) world view, publishes it and schedules the renderer.
Your renderer grabs a copy of the pointer to your world view and renders it (remember, read-only)
In the meantime, your logic creates a new, slightly different world view.
When it's done it exchanges the pointer to the current world view, publishing it as the current one.
Even if the renderer is still busy with the old world view there is no locking necessary.
Eventually the renderer finishes rendering the (old) world. It grabs the new world view and starts another run.
In the meantime, ... (goto step 4)
The only locking you need is for the time when you publish or grab the pointer to the world. As an alternative you can do atomic exchange but then you have to make sure you use smart pointers that can do that.
Most toolkits have an event loop (built above some multiplexing syscall like poll(2) -or the obsolete select-...), e.g. GTK has g_application_run (which is above:) gtk_main which is built above Glib main event loop (which in fact does a poll or something similar). Likewise, Qt has QApplication and its exec methods.
Very often, you can register timers within the event loop. For GTK, use GTimers, g_timeout_add etc. For Qt learn about its timers.
Very often, you can also register some idle or background processing, which is one of your function which is started by the event loop after other events and timeouts have been processed. Your idle function is expected to run quickly (usually it does a small step of some computation in a few milliseconds, to keep the GUI responsive). For GTK, use g_idle_add etc. IIRC, in Qt you can use a timer with a 0 delay.
So you could code even a (conceptually) single threaded application, using timeouts and idle processing.
Of course, you could use multi-threading: generally the main thread is running the event loop, and other threads can do other things. You have synchronization issues. On POSIX systems, a nice synchronization trick could be to use a pipe(7) to self: you set up a pipe before running the event loop, and your computation threads may write a few bytes on it, while the main event loop is "listening" on it (with GTK, using g_source_add_poll or async IO or GUnixInputStream etc.., with Qt, using QSocketNotifier etc....). Then, in the input handler running in the main loop for that pipe, you could access traditional global data with mutexes etc...
Conceptually, read about continuations. It is a relevant notion.
You could have a Draw and Update Method attached to all your game components. That way you can set it that while your game is running the update is called and the draw is ignored or any combination of the two. It also has the benefit of keeping logic and graphics completely separate.
Couldn't you just have a draw method for each object that needs to be drawn and make them globals. Then just run your rendering thread with a sleep delay in it. As long as your rendering thread doesn't write any information to the globals you should be fine. Look up sfml to see an example of it in action.
If you are running on a unix system you could use usleep() however that is not available on windows so you might want to look here for alternatives.
I'm working on a data import job using the Go language, I want to write each step as a closure, and use channels for communication, that is, each step is concurrent. The problem can be defined by the following structure.
Get Widgets from data source
Add translations from source 1 to Widgets.
Add translations from source 2 to Widgets.
Add pricing from source 1 to Widgets.
Add WidgetRevisions to Widgets.
Add translations from source 1 to WidgetRevisions
Add translations from source 2 to WidgetRevisions
For the purposes of this question, I'm only dealing with the first three steps which must be taken on a new Widget. I assume on that basis that step four could be implemented as a pipeline step, which in itself is implemented in terms of a sub-three-step pipeline to control the *WidgetRevision*s
To that end I've been writing a small bit of code to give me the following API:
// A Pipeline is just a list of closures, and a smart
// function to set them all off, keeping channels of
// communication between them.
p, e, d := NewPipeline()
// Add the three steps of the process
p.Add(whizWidgets)
p.Add(popWidgets)
p.Add(bangWidgets)
// Start putting things on the channel, kick off
// the pipeline, and drain the output channel
// (probably to disk, or a database somewhere)
go emit(e)
p.Execute()
drain(d)
I've implemented it already ( code at Gist or at the Go Playground) but it's deadlocking with a 100% success failure rate
The deadlock comes when calling p.Execute(), because presumably one of the channels is ending up with nothing to do, nothing being sent on any of them, and no work to do...
Adding a couple of lines of debug output to emit() and drain(), I see the following output, I believe the pipelining between the closure calls is correct, and I'm seeing some Widgets being omitted.
Emitting A Widget
Input Will Be Emitted On 0x420fdc80
Emitting A Widget
Emitting A Widget
Emitting A Widget
Output Will Drain From 0x420fdcd0
Pipeline reading from 0x420fdc80 writing to 0x420fdd20
Pipeline reading from 0x420fdd20 writing to 0x420fddc0
Pipeline reading from 0x420fddc0 writing to 0x42157000
Here's a few things I know about this approach:
I believe it's not uncommon for this design to "starve" one coroutine or another, I believe that's why this is deadlocking
I would prefer if the pipeline had things fed into it in the first place (API would implement Pipeline.Process(*Widget)
If I could make that work, the drain could be a "step" which just didn't pass anything on to the next function, that might be a cleaner API
I know I haven't implemented any kind of rung buffers, so it's entirely possible that I'll just overload the available memory of the machine
I don't really believe this is good Go style... but it seems to make use of a lot of Go features, but that isn't really a benefit
Because of the WidgetRevisions also needing a pipeline, I'd like to make my Pipeline more generic, maybe an interface{} type is the solution, I don't know Go well enough to determine if that'd be sensible or not yet.
I've been advised to consider implementing a mutex to guard against race conditions, but I believe I'm save as the closures will each operate on one particular unit of the Widget struct, however I'd be happy to be educated on that topic.
In Summary: How can I fix this code, should I fix this code, and if you were a more experienced go programmer than I, how would you solve this "sequential units of work" problem?
I just don't think I would've built the abstractions that far away from the channels. Pipe explicitly.
You can pretty easily make a single function for all of the actual pipe manipulation, looking something like this:
type StageMangler func(*Widget)
func stage(f StageMangler, chi <-chan *Widget, cho chan<- *Widget) {
for widget := range chi {
f(widget)
cho <- widget
}
close(cho)
}
Then you can pass in func(w *Widget) { w.Whiz = true} or similar to the stage builder.
Your add at that point could have a collection of these and their worker counts so a particular stage could have n workers a lot more easily.
I'm just not sure this is easier than piecing channels together directly unless you're building these pipelines at runtime.
I hope the title did not mislead you.
My problem is the following: Currently I try to speed up a raytracer and this is done with the help of the graphics card. It works fine despite the fact that it got slower by this. :)
This is caused by the fact, that I trace one ray on the whole geometry at once on the graphics card(my "tracing server") and then fetch the results, which is awfully slow, so I have to gather some rays and calc them and fetch the results together to speed this up.
The next problem is, that I am not allowed to rewrite the surrounding framework that should know nothing or least possible about this parallelization.
So here is my approach:
I thought about using several threads, where each one gets a ray and requests my "tracing server" to calc the intersections. Then the thread is stopped until enough rays were gathered to calc the intersections on the graphics card and get the results back efficiently. This means that each thread will wait until the results were fetched.
You see I already have some plan but following I do not know:
Which threading framework should I take to be platformindependent?
Should I use a threadpool of fixed size or create them as needed?
Can any given thread library handle at least 1000 waiting threads(because that would be the number that I need to gather for my fetch to be efficient)?
But I also could imagine doing this with one thread that
dumps its load (a new ray) to the "tracing server" and fetches the next load until
there is enough to fetch the results.
Then the thread would take the results one by one, do the further calculations until all results are processed and then goes back to step one until all rays are done.
Also if you have some better idea how to parallelize this, tell me about it.
Regards,
Nobody
PS
If you need this information: The two platforms I want to use are Linux and Windows.
use either Thread Building Blocks or boost::thread.
http://www.boost.org/doc/libs/1_46_0/doc/html/thread.html
http://threadingbuildingblocks.org/
As far as threadpool/on-demand-threads - threadpool is generally better idea as it avoids creation overhead.
Number of waiting threads is gonna depend on the underlying system more than anything else:
Maximum number of threads per process in Linux?
I need to implement a statistics reporter - an object that prints to screen bunch of statistic.
This info is updated by 20 threads.
The reporter must be a thread itself that wakes up every 1 sec, read the info and prints it to screen.
My design so far: InfoReporterElement - one element of info. has two function, PrintInfo and UpdateData.
InfoReporterRow - one row on screen. A row holds vector of ReporterInfoElement.
InfoReporterModule - a module composed of a header and vector of rows.
InfoRporter - the reporter composed of a vector of modules and a header. The reporter exports the function 'PrintData' that goes over all modules\rows\basic elements and prints the data to screen.
I think that I should an Object responsible to receive updates from the threads and update the basic info elements.
The main problem is how to update the info - should I use one mutex for the object or use mutex per basic element?
Also, which object should be a threads - the reporter itself, or the one that received updates from the threads?
I would say that first of all, the Reporter itself should be a thread. It's basic in term of decoupling to isolate the drawing part from the active code (MVC).
The structure itself is of little use here. When you reason in term of Multithread it's not so much the structure as the flow of information that you should check.
Here you have 20 active threads that will update the information, and 1 passive thread that will display it.
The problem here is that you encounter the risk of introducing some delay in the work to be done because the active thread cannot acquire the lock (used for display). Reporting (or logging) should never block (or as little as possible).
I propose to introduce an intermediate structure (and thread), to separate the GUI and the work: a queuing thread.
active threads post event to the queue
the queuing thread update the structure above
the displaying thread shows the current state
You can avoid some synchronization issues by using the same idea that is used for Graphics. Use 2 buffers: the current one (that is displayed by the displaying thread) and the next one (updated by the queuing thread). When the queuing thread has processed a batch of events (up to you to decide what a batch is), it asks to swap the 2 buffers, so that next time the displaying thread will display fresh info.
Note: On a more personal note, I don't like your structure. The working thread has to know exactly where on the screen the element it should update is displayed, this is a clear breach of encapsulation.
Once again, look up MVC.
And since I am neck deep in patterns: look up Observer too ;)
The main problem is how to update the
info - should i use one mutex for the
object or use mutex per basic element?
Put a mutex around the basic unit of update action. If this is an InfoReporterElement object, you'd need a mutex per such object. Otherwise, if a row is updated at a time, by any one of the threads then put the mutex around the row and so on.
Also, which object should be a threads
- the reporter itself, or the one that received updates from the threads?
You can put all of them in separate threads -- multiple writer threads that update the information and one reader thread that reads the value.
You seem to have a pretty good grasp of the basics of concurrency.
My intial thought would be a queue which has a mutex which locks for writes and deletes. If you have the time then I would look at lock-free access.
For you second concern I would have just one reader thread.
A piece of code would be nice to operate on.
Attach a mutex to every InfoReporterElement. As you've written in a comment, not only you need getting and setting element value, but also increment it or probably do another stuff, so what I'd do is make a mutexed member function for every interlocked operation I'd need.