How do I take a snapshot of all the values in a sliding-buffer channel?
I'm using the sliding buffer to store temporal measurements that I occasionally want to render the most recent (say 100) values of.
There are some circular buffer implementations making the rounds, but I'd like to leverage the core.async abstractions later.
I've tried to tap into a mult, calling (seq my-sliding-buffer) and calling take! on the tap, but Clojure complains that the SlidingBuffer implementation does not support take!.
Related
Hi I am a newbie learning Direct 3D 12.
So far, I understood that Direct 3D 12 is designed for multithreading and I'm trying to make my own simple multithread demo by following the tutorial by braynzarsoft.
https://www.braynzarsoft.net/viewtutorial/q16390-03-initializing-directx-12
Environment is windows, using C++, Visual Studio.
As far as I understand, multithreading in Direct 3D 12 seems, in a nutshell, populating command lists in multiple threads.
If it is right, it seems
1 Swap Chain
1 Command Queue
N Command Lists (N corresponds to number of threads)
N Command Allocators (N corresponds to number of threads)
1 Fence
is enough for a single window program.
I wonder
Q1. When do we need multiple command queues?
Q2. Why do we need multiple fences?
Q3. When do we submit commands multiple times?
Q4. Does GetCPUDescriptorHandleForHeapStart() return value changes?
Q3 comes from here.
https://developer.nvidia.com/sites/default/files/akamai/gameworks/blog/GDC16/GDC16_gthomas_adunn_Practical_DX12.pdf
Purpose of Q4 is I thought of calling the function once and store the value for reuse, it didn't change when I debugged.
Rendering loop in my mind is (based on Game Loop pattern), for example,
Thread waits for fence value (eg. Main thread).
Begin multiple threads to populate command lists.
Wait all threads done with population.
ExecuteCommandLists.
Swap chain present.
Return to 1 in the next loop.
If I am totally misunderstanding, please help.
Q1. When do we need multiple command queues?
Read this https://learn.microsoft.com/en-us/windows/win32/direct3d12/user-mode-heap-synchronization:
Asynchronous and low priority GPU work. This enables concurrent execution of low priority GPU work and atomic operations that enable one GPU thread to consume the results of another unsynchronized thread without blocking.
High priority compute work. With background compute it is possible to interrupt 3D rendering to do a small amount of high priority compute work. The results of this work can be obtained early for additional processing on the CPU.
Background compute work. A separate low priority queue for compute workloads allows an application to utilize spare GPU cycles to perform background computation without negative impact on the primary rendering (or other) tasks.
Streaming and uploading data. A separate copy queue replaces the D3D11 concepts of initial data and updating resources. Although the application is responsible for more details in the Direct3D 12 model, this responsibility comes with power. The application can control how much system memory is devoted to buffering upload data. The app can choose when and how (CPU vs GPU, blocking vs non-blocking) to synchronize, and can track progress and control the amount of queued work.
Increased parallelism. Applications can use deeper queues for background workloads (e.g. video decode) when they have separate queues for foreground work.
Q2. Why do we need multiple fences?
All gpu work is asynchronous. So you can think of fences as low level tools to achieve the same result as futures/coroutines. You can check if the work has been completed, wait for work to complete or set an event on completion. You need a fence whenever you need to guarantee a resource holds the output of work (when resource barriers are insufficient).
Q4. Does GetCPUDescriptorHandleForHeapStart() return value changes?
No it doesn't.
store the value for reuse, it didn't change when I debugged.
The direct3d12 samples do this, you should know them intimately if you want to become proficient.
Rendering loop in my mind is (based on Game Loop pattern), for example,
That sounds okay, but I urge you to look at the direct3d12 samples and steal the patterns (and the code) they use there.
I have a clojure processing app that is a pipeline of channels. Each processing step does its computations asynchronously (ie. makes a http request using http-kit or something), and puts it result on the output channel. This way the next step can read from that channel and do its computation.
My main function looks like this
(defn -main [args]
(-> file/tmp-dir
(schedule/scheduler)
(search/searcher)
(process/resultprocessor)
(buy/buyer)
(report/reporter)))
Currently, the scheduler step drives the pipeline (it hasn't got an input channel), and provides the chain with workload.
When I run this in the REPL:
(-main "some args")
It basically runs forever due to the infinity of the scheduler. What is the best way to change this architecture such that I can shut down the whole system from the REPL? Does closing each channel means the system terminates?
Would some broadcast channel help?
You could have your scheduler alts! / alts!! on a kill channel and the input channel of your pipeline:
(def kill-channel (async/chan))
(defn scheduler [input output-ch kill-ch]
(loop []
(let [[v p] (async/alts!! [kill-ch [out-ch (preprocess input)]]
:priority true)]
(if-not (= p kill-ch)
(recur))))
Putting a value on kill-channel will then terminate the loop.
Technically you could also use output-ch to control the process (puts to closed channels return false), but I normally find explicit kill channels cleaner, at least for top-level pipelines.
To make things simultaneously more elegant and more convenient to use (both at the REPL and in production), you could use Stuart Sierra's component, start the scheduler loop (on a separate thread) and assoc the kill channel on to your component in the component's start method and then close! the kill channel (and thereby terminate the loop) in the component's stop method.
I would suggest using something like https://github.com/stuartsierra/component to handle system setup. It ensures that you could easily start and stop your system in the REPL. Using that library, you would set it up so that each processing step would be a component, and each component would handle setup and teardown of channels in their start and stop protocols. Also, you could probably create an IStream protocol for each component to implement and have each component depend on components implementing that protocol. It buys you some very easy modularity.
You'd end up with a system that looks like the following:
(component/system-map
:scheduler (schedule/new-scheduler file/tmp-dir)
:searcher (component/using (search/searcher)
{:in :scheduler})
:processor (component/using (process/resultprocessor)
{:in :searcher})
:buyer (component/using (buy/buyer)
{:in :processor})
:report (component/using (report/reporter)
{:in :buyer}))
One nice thing with this sort of approach is that you could easily add components if they rely on a channel as well. For example, if each component creates its out channel using a tap on an internal mult, you could add a logger for the processor just by a logging component that takes the processor as a dependency.
:processor (component/using (process/resultprocessor)
{:in :searcher})
:processor-logger (component/using (log/logger)
{:in processor})
I'd recommend watching his talk as well to get an idea of how it works.
You should consider using Stuart Sierra's reloaded workflow, which depends on modelling your 'pipeline' elements as components, that way you can model your logical singletons as 'classes', meaning you can control the construction and destruction (start/stop) logic for each one of them.
I'm working on a data import job using the Go language, I want to write each step as a closure, and use channels for communication, that is, each step is concurrent. The problem can be defined by the following structure.
Get Widgets from data source
Add translations from source 1 to Widgets.
Add translations from source 2 to Widgets.
Add pricing from source 1 to Widgets.
Add WidgetRevisions to Widgets.
Add translations from source 1 to WidgetRevisions
Add translations from source 2 to WidgetRevisions
For the purposes of this question, I'm only dealing with the first three steps which must be taken on a new Widget. I assume on that basis that step four could be implemented as a pipeline step, which in itself is implemented in terms of a sub-three-step pipeline to control the *WidgetRevision*s
To that end I've been writing a small bit of code to give me the following API:
// A Pipeline is just a list of closures, and a smart
// function to set them all off, keeping channels of
// communication between them.
p, e, d := NewPipeline()
// Add the three steps of the process
p.Add(whizWidgets)
p.Add(popWidgets)
p.Add(bangWidgets)
// Start putting things on the channel, kick off
// the pipeline, and drain the output channel
// (probably to disk, or a database somewhere)
go emit(e)
p.Execute()
drain(d)
I've implemented it already ( code at Gist or at the Go Playground) but it's deadlocking with a 100% success failure rate
The deadlock comes when calling p.Execute(), because presumably one of the channels is ending up with nothing to do, nothing being sent on any of them, and no work to do...
Adding a couple of lines of debug output to emit() and drain(), I see the following output, I believe the pipelining between the closure calls is correct, and I'm seeing some Widgets being omitted.
Emitting A Widget
Input Will Be Emitted On 0x420fdc80
Emitting A Widget
Emitting A Widget
Emitting A Widget
Output Will Drain From 0x420fdcd0
Pipeline reading from 0x420fdc80 writing to 0x420fdd20
Pipeline reading from 0x420fdd20 writing to 0x420fddc0
Pipeline reading from 0x420fddc0 writing to 0x42157000
Here's a few things I know about this approach:
I believe it's not uncommon for this design to "starve" one coroutine or another, I believe that's why this is deadlocking
I would prefer if the pipeline had things fed into it in the first place (API would implement Pipeline.Process(*Widget)
If I could make that work, the drain could be a "step" which just didn't pass anything on to the next function, that might be a cleaner API
I know I haven't implemented any kind of rung buffers, so it's entirely possible that I'll just overload the available memory of the machine
I don't really believe this is good Go style... but it seems to make use of a lot of Go features, but that isn't really a benefit
Because of the WidgetRevisions also needing a pipeline, I'd like to make my Pipeline more generic, maybe an interface{} type is the solution, I don't know Go well enough to determine if that'd be sensible or not yet.
I've been advised to consider implementing a mutex to guard against race conditions, but I believe I'm save as the closures will each operate on one particular unit of the Widget struct, however I'd be happy to be educated on that topic.
In Summary: How can I fix this code, should I fix this code, and if you were a more experienced go programmer than I, how would you solve this "sequential units of work" problem?
I just don't think I would've built the abstractions that far away from the channels. Pipe explicitly.
You can pretty easily make a single function for all of the actual pipe manipulation, looking something like this:
type StageMangler func(*Widget)
func stage(f StageMangler, chi <-chan *Widget, cho chan<- *Widget) {
for widget := range chi {
f(widget)
cho <- widget
}
close(cho)
}
Then you can pass in func(w *Widget) { w.Whiz = true} or similar to the stage builder.
Your add at that point could have a collection of these and their worker counts so a particular stage could have n workers a lot more easily.
I'm just not sure this is easier than piecing channels together directly unless you're building these pipelines at runtime.
I'm reading values from a sensor via serial (accelerometer values), in a loop similar to this:
while( 1 ) {
vector values = getAccelerometerValues();
// Calculate velocity
// Calculate total displacement
if ( displacement == 0 )
print("Back at origin");
}
I know the time that it takes per sample, which is taken care of in the getAccelerometerValues(), so I have a time-period to calculate velocity, displacement etc. I sample at approximately 120 samples / second.
This works, but there are bugs (non-precise accelerometer values, floating-point errors, etc), and calibrating and compensating to get reasonably accurate displacement values is proving difficult.
I'm having great amounts of trouble finding a process to debug the loop. If I use a debugger (my code happens to be written in C++, and I'm slowly learning to use gdb rather than print statements), I have issues stepping through and pushing my sensor around to get an accelerometer reading at the point in time that the debugger executes the line. That is, it's very difficult to get the timing of "continue to next line" and "pushing sensor so it's accelerating" right.
I can use lots of print statements, which tend to fly past on-screen, but due to the number of samples, this gets tedious and difficult to derive where the problems are, particularly if there's more than one print statement per loop tick.
I can decrease the number of samples, which improves the readability of the programs output, but drastically decreases the reliability of the acceleration values I poll from the sensor; if I poll at 1Hz, the chances of polling the accelerometer value while it's undergoing acceleration drop considerably.
Overall, I'm having issues stepping through the code and using realistic data; I can step through it with non-useful data, or I can not really step through it with better data.
I'm assuming print statements aren't the best debugging method for this scenario. What's a better option? Are there any kinds of resources that I would find useful (am I missing something with gdb, or are there other tools that I could use)? I'm struggling to develop a methodology to debug this.
A sensible approach would be to use some interface for getAccelerometerValues() that you can substitute at either runtime or build-time, such as passing in a base-class pointer with a virtual method to override in the concrete derived class.
I can describe the mechanism in more detail if you need, but the ideal is to be able to run the same loop against:
real live data
real live data (and save it to file)
canned real data saved from a previous run
fake data you cooked up as a test case
Note in particular that the 'replay' version(s) should be easy to debug, if each call just returns the next data from the file.
Create if blocks for the exact conditions you want to debug. For example, if you only care about when the accelerometer reads that you are moving left:
if(movingLeft(values) {
print("left");
}
The usual solution to this problem is recording. Your capture sample sequences from your sensor in a real-time manner, and store them to files. Then you train your system, debug your code, etc. using the recorded data. Finally, you connect the working code to the real data stream which flows immediately from the sensor.
I would debug your code with fake (ie: random) values, before everything else.
It the computations work as expected, then I would use the values read from the port.
Also, isnt' there a way to read those values in a callback/push fashion, that is, get your function called only when there's new, reliable data?
edit: I don't know what libraries you are using, but in the .NET framework you can use a SerialPort class with the Event DataReceived. That way you are sure to use the most actual and reliable data.
I am trying to make some design decisions for an algorithm I am working on. I think that I want to use signals and slots to implement an observer pattern, but I am not sure of a few things.
Here is the algorithm I am working towards:
1.) Load tiles of an image from a large file
1a.) Copy the entire file to a new location
2.) Process the tiles as they are loaded
3.) If the copy has been created, copy the resulting data into the new file
so I envision having a class with functions like loadAllTiles() which would emit signals to tell a processTile() that another tile was ready to be processed, while moving on to load the next tile.
processTile() would perform some calculations, and when complete, signal to writeResults() that a new set of results data was ready to be written. writeResults() would verify that the copying was complete, and start writing the output data.
Does this sound reasonable? is there a way to make loadAllTiles() load in a tile, pass that data somehow to processTile() and then keep going and load the next tile? I was thinking about maybe setting up a list of some sort to store the tiles ready to be processed, and another list for result tiles ready to be written to disk. I guess the disadvantage there is I have to somehow keep those lists in tact, so that multiple threads arent trying to add/remove items from the list.
Thanks for any insight.
It's not completely clear in your question, but it seems that you want to split up the work into several threads, so that the processing of tiles can begin before you finishing loading the entire set.
Consider a multithreaded processing pipeline architecture. Assign one thread per task (loading, copying, processing), and pass around tiles between tasks via Producer-Consumer queues (aka BlockingQueue). To be more precise, pass around pointers (or shared pointers) to tiles to avoid needless copying.
There doesn't seem to be a ready-made thread-safe BlockingQueue class in Qt, but you can roll-up your own using QQueue, QWaitCondition, and QMutex. Here are some sources of inspiration:
Just Software Solutions' blog article.
Java's BlockingQueue
ZThreads's BlockingQueue
While there isn't a ready-made BlockingQueue within Qt, it seems that using signals & slots with the Qt::QueuedConnection option may serve the same purpose. This Qt blog article makes such use of signals and slots.
You may want to combine this pipeline approach with a memory pool or free list of tiles, so that already allocated tiles are recycled in your pipeline.
Here's a conceptual sketch of the pipeline:
TilePool -> TileLoader -> PCQ -> TileProcessor -> PCQ -> TileSaver -\
^ |
\----------------------------------------------------------------/
where PCQ represents a Producer-Consumer queue.
To exploit even more parallelism, you can try thread pools at each stage.
You can also consider checking out Intel's Threading Building Blocks. I haven't tried it myself. Be aware of the GPL licence for the open source version.
Keeping the lists from corruption should be possible with any kind of parallelization lock mechanisms, be it simple locks, semaphores etc etc.
Otherwise, the approach sounds reasonable, even though I would say that the files had to be large in order for this to make sense. As long as they easily fit into memory, I don't see the point of loading them piecewise. Also: How do you plan on extracting tiles without reading the entire image repeatedly?