multiple processing in WinCE6.0 or DMA implementation - c++

in my application i want to do the task parallely like one thread will do the calculation, and other will draw the data on screen, but while drawing the data processor is gettign engaged and during that time it is not able to process the data of diffrent thread. i runnig both thread on above normal priorty. Is there any way in whch i can do the drawing parallely, so that measurment thread can do the calculation at that speed wthout getting affected by drawing thread. i heared from some one DMA can solve the problem, but how to imlement it in WINCE6.0 platform i have no idea.
Pls provide any pointer
Mukesh

No idea how DMA would "solve" this issue - you're using a single processor core, which can only execute one set of instructions at a time. DMA won't change that.
The problem you're having sounds like you're using the processor at just about full capacity and so you're not seeing much time sharing between your threads. There are generally 2 ways to approach this.
1) adjust the priority of your more important thread to have it get more time from the scheduler to do its work.
or
2) adjust the thread quantum for your threads to force the scheduler to swap between threads more frequently.

Related

Multithreading in Direct 3D 12

Hi I am a newbie learning Direct 3D 12.
So far, I understood that Direct 3D 12 is designed for multithreading and I'm trying to make my own simple multithread demo by following the tutorial by braynzarsoft.
https://www.braynzarsoft.net/viewtutorial/q16390-03-initializing-directx-12
Environment is windows, using C++, Visual Studio.
As far as I understand, multithreading in Direct 3D 12 seems, in a nutshell, populating command lists in multiple threads.
If it is right, it seems
1 Swap Chain
1 Command Queue
N Command Lists (N corresponds to number of threads)
N Command Allocators (N corresponds to number of threads)
1 Fence
is enough for a single window program.
I wonder
Q1. When do we need multiple command queues?
Q2. Why do we need multiple fences?
Q3. When do we submit commands multiple times?
Q4. Does GetCPUDescriptorHandleForHeapStart() return value changes?
Q3 comes from here.
https://developer.nvidia.com/sites/default/files/akamai/gameworks/blog/GDC16/GDC16_gthomas_adunn_Practical_DX12.pdf
Purpose of Q4 is I thought of calling the function once and store the value for reuse, it didn't change when I debugged.
Rendering loop in my mind is (based on Game Loop pattern), for example,
Thread waits for fence value (eg. Main thread).
Begin multiple threads to populate command lists.
Wait all threads done with population.
ExecuteCommandLists.
Swap chain present.
Return to 1 in the next loop.
If I am totally misunderstanding, please help.
Q1. When do we need multiple command queues?
Read this https://learn.microsoft.com/en-us/windows/win32/direct3d12/user-mode-heap-synchronization:
Asynchronous and low priority GPU work. This enables concurrent execution of low priority GPU work and atomic operations that enable one GPU thread to consume the results of another unsynchronized thread without blocking.
High priority compute work. With background compute it is possible to interrupt 3D rendering to do a small amount of high priority compute work. The results of this work can be obtained early for additional processing on the CPU.
Background compute work. A separate low priority queue for compute workloads allows an application to utilize spare GPU cycles to perform background computation without negative impact on the primary rendering (or other) tasks.
Streaming and uploading data. A separate copy queue replaces the D3D11 concepts of initial data and updating resources. Although the application is responsible for more details in the Direct3D 12 model, this responsibility comes with power. The application can control how much system memory is devoted to buffering upload data. The app can choose when and how (CPU vs GPU, blocking vs non-blocking) to synchronize, and can track progress and control the amount of queued work.
Increased parallelism. Applications can use deeper queues for background workloads (e.g. video decode) when they have separate queues for foreground work.
Q2. Why do we need multiple fences?
All gpu work is asynchronous. So you can think of fences as low level tools to achieve the same result as futures/coroutines. You can check if the work has been completed, wait for work to complete or set an event on completion. You need a fence whenever you need to guarantee a resource holds the output of work (when resource barriers are insufficient).
Q4. Does GetCPUDescriptorHandleForHeapStart() return value changes?
No it doesn't.
store the value for reuse, it didn't change when I debugged.
The direct3d12 samples do this, you should know them intimately if you want to become proficient.
Rendering loop in my mind is (based on Game Loop pattern), for example,
That sounds okay, but I urge you to look at the direct3d12 samples and steal the patterns (and the code) they use there.

Adapting program from single to multicore

I am considering a programming project. Will run under Ubuntu or other Linux OS on a small board. Quad core x86 - N-Series Pentium. The software generates 8 fast signals; square wave pulse trains for stepper motor motion control of 4 axes. Step signals being 50-100 KHz maximum, but usually slower. Want to avoid jitter in these stepping signals (call it good fidelity), so that around 1-2us for each thread loop cycle would be a nice target. The program does other kinds of tasks, like read/write hard drive, Ethernet, continues update on the graphics display, keyboard. The existing Single core programs just can not process motion signals with this kind of timing and require external hardware/techniques to achieve this.
I have been reading other posts here, like on a thread running selected core, continuously. The exact meaning in these posts is "lose", not sure really what is meant. Continuous might mean testing every minute or ?????
So, I might be wordy, but it will clear I hope. The considered program has all the threads, routines, memory, shared memory all included. I am not considering that this program launches another program or service. Other threads are written in this program and launched when the program starts up. I will call this signal generating thread the FAST THREAD.
The FAST THREAD is to be launched to an otherwise "free" core. It needs to be the only thread that runs on the core. Hopefully, the OS thread scheduler component on that core can be "turned off", so that it does not even interrupt on that core to decide what thread runs next. In looking at the processor manual, Each core has a counter timer chip. Is it possible then that I can use it to provide a continuous train of interrupts then into my "locked in" FAST THREAD for timing purposes? This is the range of about 1-2 us. If not, then just reading one channel on that CTC to provide software sync. This fast thread will, therefore, see (experience) no delays from the interrupts issued in the other cores and associated multicore fabric. This FAST THREAD, when running, will continue to run until the program closes. This could be hours.
Input data to drive this FAST THREAD will be common shared memory defined in the program. There are also hardware signals for motion limits (From GPIOs or SDI port). If any go TRUE, that forces a programmed halt all motion. It does not need a 1~2us response. It could go to a slower Motion loop.
Ah, the output:
Some motion data is written back to the shared memory (assigned for this purpose). Like current location, and current loop number,
Some signals need to be output (the 8 outputs). There are numerous free GPIOs. Not sure of the path taken to get the signaled GPIO pin to change the output. The system call to Linux initiates the pin change event. There is also an SDI port available, running up to the 25Mhz clock. It seems these ports (GPIO, UART, USB, SDI) exist in the fabric that is not on any specific core. I am not sure of the latency from the issuance of these signals in the program until the associated external pin actually presents that signal. In the fast thread, even 10us would be OK, if it was always the same latency! I know that will not be so, there will jitter. I need to think on this spec.
There will possibly be a second dedicated core (similar to above) for slower motion planning. That leaves two cores for everything thing else. Since then everything else items (sata, video screen, keyboard ...) are already able to work in a single core, then the remaining two cores should be great.
At close of program, the FAST THREAD returns the CTC and any other device on its core back to "as it was", re-enables the OS components in this core to their more normal operation. End of thread.
Concluding: I have described the overall program, so as for you to understand what I want to do with this FAST THREAD running, how responsive it needs to be, and that it needs to be undisturbed!! This processor runs in the 1.5 ~ 2.0 GHz range. It certainly can do the repeated calculations in the required time frame.
DESIRED: I do not know the system calls that would allow me to use a selected x86 core in this way. Any pointers would be helpful. Any manual or document that described these calls/procedures.
Can this use of a core also be done in windows 7,10)?
Thanks for reading and any pointers you have.
Stan

C++ Qt fast timing of asynchronous processes advice

i'm currently dealing with a Qt GUI i have to set up for a measurement device. The device is working with a frame grabber card which grabs images from a line camera really fast. My image processing which is not that complex takes 0.2ms to complete, and it takes about 40ms to display the signal and the processing result with QCustomPlot which is totally okay.
Besides the GUI output the processed signal will also be put out as an analog signal by a NI DAQ device.
My problem is that i have to update the analog signal with a constant frequency and still update the GUI from time to time.
My current approach or idea was to create a data pool thread and two worker threads. One worker thread receives the data from the frame grabber, processes it an updates the data pool. The second worker thread updates the analog channel of the NI DAQ with a certain frequency of about 2-5kHz given by a clock in the NI DAQ device.
And the GUI thread would read the data pool from to time to time to update the signal display with a rate of about 20-30Hz.
I wanted to use the Qt thread management and he signal-and-slot mechanism because of its "simplicity" and because i already worked with threads in combination with Qt and its thread classes.
Is there maybe a better way, does somebody have an idead or any suggestion? Is it possible that i get problems in the timing of the threads?
Furhtermore is it possible to assign one thread to one single CPU core on a multi core CPU, so that this core only processes this single thread?
Is there maybe a better way, does somebody have an idead or any suggestion? Is it possible that i get problems in the timing of the threads?
Signal/Slot mechanism is fine, try it and if you get into performance issues you can still try to find another approach. I used Signal/Slot Mechanism for real-time video processing with QAbstractVideoSurface and Mediaplayer. It worked for me.
Furhtermore is it possible to assign one thread to one single CPU core on a multi core CPU, so that this core only processes this single thread?
Why would you do that? The operating system or threading library has a scheduler, which takes care of such things. As long you got no good reason doing this yourself, you should just use the existing way.
I would try it with three threads: 1)UI thread, 2)grab-and-process thread, 3)analogue output thread.
The trick is to use a triple buffer to connect output of grab-and-process to input of analogue output.
Say, at moment t, thread(2) finishes processing frame[(t+0)%3], change output destination to frame[(t+1)%3] immediately, and notifies thread(3), which is looping through data in frame[(t+2)%3], to switch to frame[(t+0)%3] when appropriate.
I used this technique when I was working on an image processing project that has a 10fps processing frame rate and a 60fps NTSC output frame rate. To eliminate the tearing effect, a circular buffer with three buffers is the least.

Idendify the reason for a 200 ms freezing in a time critical loop

New description of the problem:
I currently run our new data acquisition software in a test environment. The software has two main threads. One contains a fast loop which communicates with the hardware and pushes the data into a dual buffer. Every few seconds, this loop freezes for 200 ms. I did several tests but none of them let me figure out what the software is waiting for. Since the software is rather complex and the test environment could interfere too with the software, I need a tool/technique to test what the recorder thread is waiting for while it is blocked for 200 ms. What tool would be useful to achieve this?
Original question:
In our data acquisition software, we have two threads that provide the main functionality. One thread is responsible for collecting the data from the different sensors and a second thread saves the data to disc in big blocks. The data is collected in a double buffer. It typically contains 100000 bytes per item and collects up to 300 items per second. One buffer is used to write to in the data collection thread and one buffer is used to read the data and save it to disc in the second thread. If all the data has been read, the buffers are switched. The switch of the buffers seems to be a major performance problem. Each time the buffer switches, the data collection thread blocks for about 200 ms, which is far too long. However, it happens once in a while, that the switching is much faster, taking nearly no time at all. (Test PC: Windows 7 64 bit, i5-4570 CPU #3.2 GHz (4 cores), 16 GB DDR3 (800 MHz)).
My guess is, that the performance problem is linked to the data being exchanged between cores. Only if the threads run on the same core by chance, the exchange would be much faster. I thought about setting the thread affinity mask in a way to force both threads to run on the same core, but this also means, that I lose real parallelism. Another idea was to let the buffers collect more data before switching, but this dramatically reduces the update frequency of the data display, since it has to wait for the buffer to switch before it can access the new data.
My question is: Is there a technique to move data from one thread to another which does not disturb the collection thread?
Edit: The double buffer is implemented as two std::vectors which are used as ring buffers. A bool (int) variable is used to tell which buffer is the active write buffer. Each time the double buffer is accessed, the bool value is checked to know which vector should be used. Switching the buffers in the double buffer just means toggling this bool value. Of course during the toggling all reading and writing is blocked by a mutex. I don't think that this mutex could possibly be blocking for 200 ms. By the way, the 200 ms are very reproducible for each switch event.
Locking and releasing a mutex just to switch one bool variable will not take 200ms.
Main problem is probably that two threads are blocking each other in some way.
This kind of blocking is called lock contention. Basically this occurs whenever one process or thread attempts to acquire a lock held by another process or thread. Instead parallelism you have two thread waiting for each other to finish their part of work, having similar effect as in single threaded approach.
For further reading I recommend this article for a read, which describes lock contention with more detailed level.
Since you are running on windows maybe you use visual studio? if yes I would resort to VS profiler which is quite good (IMHO) in such cases, once you don't need to check data/instruction caches (then the Intel's vTune is a natural choice). From my experience VS is good enough to catch contention problems as well as CPU bottlenecks. you can run it directly from VS or as standalone tool. you don't need the VS installed on your test machine you can just copy the tool and run it locally.
VSPerfCmd.exe /start:SAMPLE /attach:12345 /output:samples - attach to process 12345 and gather CPU sampling info
VSPerfCmd.exe /detach:12345 - detach from process
VSPerfCmd.exe /shutdown - shutdown the profiler, the samples.vsp is written (see first line)
then you can open the file and inspect it in visual studio. if you don't see anything making your CPU busy switch to contention profiling - just change the "start" argument from "SAMPLE" to "CONCURRENCY"
The tool is located under %YourVSInstallDir%\Team Tools\Performance Tools\, AFAIR it is available from VS2010
Good luck
After discussing the problem in the chat, it turned out that the Windows Performance Analyser is a suitable tool to use. The software is part of the Windows SDK and can be opened using the command wprui in a command window. (Alois Kraus posted this useful link: http://geekswithblogs.net/akraus1/archive/2014/04/30/156156.aspx in the chat). The following steps revealed what the software had been waiting on:
Record information with the WPR using the default settings and load the saved file in the WPA.
Identify the relevant thread. In this case, the recording thread and the saving thread obviously had the highest CPU load. The saving thread could be easily identified. Since it saves data to disc, it is the one that with file access. (Look at Memory->Hard Faults)
Check out Computation->CPU usage (Precise) and select Utilization by Process, Thread. Select the process you are analysing. Best display the columns in the order: NewProcess, ReadyingProcess, ReadyingThreadId, NewThreadID, [yellow bar], Ready (µs) sum, Wait(µs) sum, Count...
Under ReadyingProcess, I looked for the process with the largest Wait (µs) since I expected this one to be responsible for the delays.
Under ReadyingThreadID I checked each line referring to the thread with the delays in the NewThreadId column. After a short search, I found a thread that showed frequent Waits of about 100 ms, which always showed up as a pair. In the column ReadyingThreadID, I was able to read the id of the thread the recording loop was waiting for.
According to its CPU usage, this thread did basically nothing. In our special case, this led me to the assumption that the serial port io command could cause this wait. After deactivating them, the delay was gone. The important discovery was that the 200 ms delay was in fact composed of two 100 ms delays.
Further analysis showed that the fetch data command via the virtual serial port pair gets sometimes lost. This might be linked to very high CPU load in the data saving and compression loop. If the fetch command gets lost, no data is received and the first as well as the second attempt to receive the data timed out with their 100 ms timeout time.

Platform independent parallelization without changing the framework?

I hope the title did not mislead you.
My problem is the following: Currently I try to speed up a raytracer and this is done with the help of the graphics card. It works fine despite the fact that it got slower by this. :)
This is caused by the fact, that I trace one ray on the whole geometry at once on the graphics card(my "tracing server") and then fetch the results, which is awfully slow, so I have to gather some rays and calc them and fetch the results together to speed this up.
The next problem is, that I am not allowed to rewrite the surrounding framework that should know nothing or least possible about this parallelization.
So here is my approach:
I thought about using several threads, where each one gets a ray and requests my "tracing server" to calc the intersections. Then the thread is stopped until enough rays were gathered to calc the intersections on the graphics card and get the results back efficiently. This means that each thread will wait until the results were fetched.
You see I already have some plan but following I do not know:
Which threading framework should I take to be platformindependent?
Should I use a threadpool of fixed size or create them as needed?
Can any given thread library handle at least 1000 waiting threads(because that would be the number that I need to gather for my fetch to be efficient)?
But I also could imagine doing this with one thread that
dumps its load (a new ray) to the "tracing server" and fetches the next load until
there is enough to fetch the results.
Then the thread would take the results one by one, do the further calculations until all results are processed and then goes back to step one until all rays are done.
Also if you have some better idea how to parallelize this, tell me about it.
Regards,
Nobody
PS
If you need this information: The two platforms I want to use are Linux and Windows.
use either Thread Building Blocks or boost::thread.
http://www.boost.org/doc/libs/1_46_0/doc/html/thread.html
http://threadingbuildingblocks.org/
As far as threadpool/on-demand-threads - threadpool is generally better idea as it avoids creation overhead.
Number of waiting threads is gonna depend on the underlying system more than anything else:
Maximum number of threads per process in Linux?