How can streams offer concurrent execution in CUDA?

How can streams offer concurrent execution in CUDA? - concurrency

In the CUDA documentation, it is mentioned that if we use 2 streams (stream0 and stream1) like this way: we copy data in stream0 then we launch the first kernel in stream0 , then we recuperate data from the device in stream0, and then the same operations are made in stream1, this way, like mentioned in the book "CUDA by example 2010", doesn't offer the concurrent execution, but in the "concurrent kernels sample" this method is used and offers the concurrent execution. So can you help me please to understand the difference between the two examples?

Overlapped data transfer depends on many factors including compute capability version and coding styles. This blog may provide more info.
https://developer.nvidia.com/content/how-overlap-data-transfers-cuda-cc

I'm just expanding Eric's answer.
In the CUDA C Programming Guide, the example is reported of using 2 streams, say stream0 and stream1, to do the following
CASE A
memcpyHostToDevice --- stream0
kernel execution --- stream0
memcpyDeviceToHost --- stream0
memcpyHostToDevice --- stream1
kernel execution --- stream1
memcpyDeviceToHost --- stream1
In other words, all the operations of stream0 are issued first and then those regarding stream1. The same example is reported in the "CUDA By Example" book, Section 10.5, but it is "apparently" concluded (in "apparent" contradition with the guide) that in this way concurrency is not achieved.
In Section 10.6 of "CUDA By Example", the following alternative use of streams is proposed
CASE B
memcpyHostToDevice --- stream0
memcpyHostToDevice --- stream1
kernel execution --- stream0
kernel execution --- stream1
memcpyDeviceToHost --- stream0
memcpyDeviceToHost --- stream1
In other words, the mem copy operations and kernel executions of stream0 and stream1 are now interlaced. The book points how with this solution concurrency is achieved.
Actually, there is no contradition between the "CUDA By Example" book and the CUDA C Programming guide, since the discussion in the book has been carried out with particular reference to a GTX 285 card while, as already pointed out by Eric and in the quoted blog post How to Overlap Data Transfers in CUDA C/C++, concurrency can be differently achieved on different architectures, as a result of dependencies and copy engines available.
For example, the blog considers two cards: C1060 and C2050. The former has one kernel engine and one copy engine which can issue only one memory transaction (H2D or D2H) at a time. The latter has one kernel engine and two copy engines which can simultaneously issue two memory transactions (H2D and D2H) at a time. What happens for the C1060, having only one copy engine, is the following
CASE A - C1060 - NO CONCURRENCY ACHIEVED
Stream Kernel engine Copy engine Comment
stream0 ---- memcpyHostToDevice ----
stream0 ---- kernel execution ---- Depends on previous memcpy
stream0 ---- memcpyDeviceToHost ---- Depends on previous kernel
stream1 ---- memcpyHostToDevice ----
stream1 ---- kernel execution ---- Depends on previous memcpy
stream1 ---- memcpyDeviceToHost ---- Depends on previous kernel
CASE B - C1060 - CONCURRENCY ACHIEVED
Stream Kernel engine Copy engine Comment
stream0 ---- memcpyHostToDevice 0 ----
stream0/1 ---- Kernel execution 0 ---- memcpyHostToDevice 1 ----
stream0/1 ---- Kernel execution 1 ---- memcpyDeviceToHost 0 ----
stream1 ---- memcpyDeviceToHost 1 ----
Concerning the C2050 and considering the case of 3 streams, in CASE A concurrency is now achieved, opposite to C1060.
CASE A - C2050 - CONCURRENCY ACHIEVED
Stream Kernel engine Copy engine H2D Copy engine D2H
stream0 ---- memcpyHostToDevice 0 ----
stream0/1 ---- kernel execution 0 ---- memcpyHostToDevice 1 ----
stream0/1/2 ---- kernel execution 1 ---- memcpyHostToDevice 2 ---- memcpyDeviceToHost 0
stream0/1/2 ---- kernel execution 2 ---- memcpyDeviceToHost 1
stream2 ---- memcpyDeviceToHost 2

Related

Cuda Multi-GPU Latency

I'm new to CUDA and I'm trying to analyse the performance of two GPUs (RTX 3090; 48GB vRAM) in parallel. The issue I face is that for the simple block of code shown below, I would expect this overall block to complete at the same time regardless of the presence of Device 2 code, as they are running Asynchronously on different streams.
// aHost, bHost, cHost, dHost are pinned memory. All arrays are of same length.
for(int i = 0; i < 2; i++){
// ---------- Device 1 code -----------
cudaSetDevice(0);
cudaMemcpyAsync(aDest, aHost, N* sizeof(float), cudaMemcpyHostToDevice, stream1);
cudaMemcpyAsync(bDest, bHost, N* sizeof(float), cudaMemcpyHostToDevice, stream1);
// ---------- Device 2 code -----------
cudaSetDevice(1);
cudaMemcpyAsync(cDest, cHost, N* sizeof(float), cudaMemcpyHostToDevice,stream2);
cudaStreamSynchronize(stream1);
cudaStreamSynchronize(stream2);
}
But alas, when I do end up running the block, running Device 1 code alone takes 80ms but adding Device 2 code to the above block adds 20ms, thus reaching 100ms as execution time for the block. I tried profiling the above code and observed the following:
Device 1 + Device 2 Concurrently (image)
When I run Device 1 alone though, I get the below profile:
Device 1 alone (image)
I can see that the initial HtoD process of Device 1 is extended in duration when I add Device 2, and I'm not sure why this is happening cause as far as I'm concerned, these processes are running independently, on different GPUs.
I realise that I haven't created any seperate CPU threads to handle seperate devices but I'm not sure if that would help. Could someone please help me understand why this elongation of duration happens when I add Device 2 code?
EDIT:
Tried profiling the code, and expected the execution durations to be independent of GPU, although I realise MemCpyAsync involves the host as well and perhaps the addition of Device 2 gives rise to more stress on the CPU as it now has to handle additional transfers...?

Using tbb::flow::graph with an embarrassingly parallel portion

I'm new to using tbb::flow and was looking to create a graph that had a portion of it that is embarrassingly parallel. So the idea is to have a message come in to a node that does some pre-processing and then can formulate a set of tasks that can be executed in parallel. And then the data is aggreagated in a multifunction_node that sends the results out to a couple of places.
-------------- ------
|parallel nodes| |Output|
----------- /|--------------|\ ---------- / ------
msg -> |pre-process| -|parallel nodes|- |aggregator|
----------- \|--------------|/ ---------- \ ------
|parallel nodes| |Output|
-------------- ------
Now the aggregator can't send out it's work until the work is done. So it would need to keep track of the number of answers expected. Can I do this with a tbb::flow::graph or should I create a function node that has a parallel for embedded into it? Other ideas or options?
If I can do it with tbb::flow what are the node types and queueing strategies?
Another way of thinking about this is it is a MapReduce kind of operation with a little preprocessing and the results being sent to a few different places in slightly different forms.

Profiling OpenCV using OProfile

I have this basic OpenCV program:
#include <iostream>
#include "opencv2/opencv.hpp"
int main(){
std::cout<<"Reading Image..."<<std::endl;
cv::Mat img = cv::imread("all_souls_000000.jpg", cv::IMREAD_GRAYSCALE);
if(!img.data)
std::cerr<<"Error reading image"<<std::endl;
return 0;
}
Which creates the executable ReadImage. I want to profile it using OProfile. However, running:
operf ./ReadImage > ReadImage.log
Returns:
Kernel profiling is not possible with current system config.
Set /proc/sys/kernel/kptr_restrict to 0 to collect kernel samples.
operf: Profiler started
* * * * WARNING: Profiling rate was throttled back by the kernel * * * *
The number of samples actually recorded is less than expected, but is
probably still statistically valid. Decreasing the sampling rate is the
best option if you want to avoid throttling.
Profiling done.
Why this happens? What is the best way to profile OpenCV?

I was able to run operf on an opencv app, with this result, is this what you are looking for?
Profiling started at Tue Jan 31 16:52:48 2017
Profiling stopped at Tue Jan 31 16:52:53 2017
-- OProfile/operf Statistics --
Nr. non-backtrace samples: 337018
Nr. kernel samples: 5603
Nr. user space samples: 331415
Nr. samples lost due to sample address not in expected range for domain: 0
Nr. lost kernel samples: 0
Nr. samples lost due to sample file open failure: 0
Nr. samples lost due to no permanent mapping: 0
Nr. user context kernel samples lost due to no app info available: 0
Nr. user samples lost due to no app info available: 0
Nr. backtraces skipped due to no file mapping: 0
Nr. hypervisor samples dropped due to address out-of-range: 0
Nr. samples lost reported by perf_events kernel: 0

How to find the task which core dumped?

Given a core dump of say a router/switch say running vxworks. How to find the task which core dumped ?
My guess list down the tasks and see which is in running state, is that right ? Please let me know .

You can use coreDumpShow() from coreDumpLib.h (vxWorks 6.9).
-> coreDumpShow 0,1
NAME IDX VALID ERRNO SIZE CKSUM TYPE TASK
----------- --- ----- ---------- ---------- ----- ------------ ----------
vxcore1.z 1 Y N/A 0x000ef05b OK KERNEL_TASK t1
Core Dump detailed information:
-------------------------------
Time: THU JAN 01 00:01:42 1970 (ticks = 6174)
Task: "t1" (0x611e0a20)
Process: "(Kernel)" (0x6017a500)
Description: fatal kernel task-level exception!
Exception number: 0xb
Program counter: 0x6003823e
Stack pointer: 0x604d3da8
Frame pointer: 0x604d3fb0
value = 0 = 0x0
You might also want to try looking in coreDumpUtilLib:
coreDumpIsAvailable( ) - is a kernel core dump available for retrieval
coreDumpNextGet( ) - get the next kernel core dump on device
coreDumpInfoGet( ) - get information on a kernel core dump
coreDumpOpen( ) - open an existing kernel core dump for retrieval
coreDumpClose( ) - close a kernel core dump
coreDumpRead( ) - read from a kernel core dump file
coreDumpCopy( ) - copy a kernel core dump to the given path

Writing multiple files slows down after x seconds

I have code which gets frames from a camera and then saves it to disk. The structure of the code is: multiple threads malloc and copy their frames into new memory, enqueue memory. Finally, another thread removes frames from queue and writes them (using ffmpeg API, raw video no compression) to their files (actually I'm using my own memory pool so malloc is only called when I need more buffers). I can have upto 8 files/cams open at the same time enqueing.
The problem is that for the first 45 sec everything works fine: there's never more than one frame on queue. But after that my queue gets backed up, processing takes just a few ms longer resulting in increased ram usage because I cannot save the frames fast enough so I have to malloc more memory to store them.
I have a 8 core, 16GB RAM Windows 7 64 bit computer (NTFS, lots of free space in second disk drive). The disk is supposed to be able to write upto 6Gbits/sec. To save my data in time I need to be able to write at 50 MB/sec. I tested disk speed with "PassMark PerformanceTest" and I had 8 threads writing files simultaneously exactly like ffmpeg saves files (synchronized, uncached I/O) and it was able to achieve 100MB/sec. So why isn't my writes able to achieve that?
Here's how the ffmpeg writes look from Process monitor logs:
Time of Day Operation File# Result Detail
2:30:32.8759350 PM WriteFile 8 SUCCESS Offset: 749,535,120, Length: 32,768
2:30:32.8759539 PM WriteFile 8 SUCCESS Offset: 749,567,888, Length: 32,768
2:30:32.8759749 PM WriteFile 8 SUCCESS Offset: 749,600,656, Length: 32,768
2:30:32.8759939 PM WriteFile 8 SUCCESS Offset: 749,633,424, Length: 32,768
2:30:32.8760314 PM WriteFile 8 SUCCESS Offset: 749,666,192, Length: 32,768
2:30:32.8760557 PM WriteFile 8 SUCCESS Offset: 749,698,960, Length: 32,768
2:30:32.8760866 PM WriteFile 8 SUCCESS Offset: 749,731,728, Length: 32,768
2:30:32.8761259 PM WriteFile 8 SUCCESS Offset: 749,764,496, Length: 32,768
2:30:32.8761452 PM WriteFile 8 SUCCESS Offset: 749,797,264, Length: 32,768
2:30:32.8761629 PM WriteFile 8 SUCCESS Offset: 749,830,032, Length: 32,768
2:30:32.8761803 PM WriteFile 8 SUCCESS Offset: 749,862,800, Length: 32,768
2:30:32.8761977 PM WriteFile 8 SUCCESS Offset: 749,895,568, Length: 32,768
2:30:32.8762235 PM WriteFile 8 SUCCESS Offset: 749,928,336, Length: 32,768, Priority: Normal
2:30:32.8762973 PM WriteFile 8 SUCCESS Offset: 749,961,104, Length: 32,768
2:30:32.8763160 PM WriteFile 8 SUCCESS Offset: 749,993,872, Length: 32,768
2:30:32.8763352 PM WriteFile 8 SUCCESS Offset: 750,026,640, Length: 32,768
2:30:32.8763502 PM WriteFile 8 SUCCESS Offset: 750,059,408, Length: 32,768
2:30:32.8763649 PM WriteFile 8 SUCCESS Offset: 750,092,176, Length: 32,768
2:30:32.8763790 PM WriteFile 8 SUCCESS Offset: 750,124,944, Length: 32,768
2:30:32.8763955 PM WriteFile 8 SUCCESS Offset: 750,157,712, Length: 32,768
2:30:32.8764072 PM WriteFile 8 SUCCESS Offset: 750,190,480, Length: 4,104
2:30:32.8848241 PM WriteFile 4 SUCCESS Offset: 750,194,584, Length: 32,768
2:30:32.8848481 PM WriteFile 4 SUCCESS Offset: 750,227,352, Length: 32,768
2:30:32.8848749 PM ReadFile 4 END OF FILE Offset: 750,256,128, Length: 32,768, I/O Flags: Non-cached, Paging I/O, Synchronous Paging I/O, Priority: Normal
2:30:32.8848989 PM WriteFile 4 SUCCESS Offset: 750,260,120, Length: 32,768
2:30:32.8849157 PM WriteFile 4 SUCCESS Offset: 750,292,888, Length: 32,768
2:30:32.8849319 PM WriteFile 4 SUCCESS Offset: 750,325,656, Length: 32,768
2:30:32.8849475 PM WriteFile 4 SUCCESS Offset: 750,358,424, Length: 32,768
2:30:32.8849637 PM WriteFile 4 SUCCESS Offset: 750,391,192, Length: 32,768
2:30:32.8849880 PM WriteFile 4 SUCCESS Offset: 750,423,960, Length: 32,768, Priority: Normal
2:30:32.8850400 PM WriteFile 4 SUCCESS Offset: 750,456,728, Length: 32,768
2:30:32.8850727 PM WriteFile 4 SUCCESS Offset: 750,489,496, Length: 32,768, Priority: Normal
This looks very efficient, however, from DiskMon the actual disk writes look ridiculously fragmented back and forth writing which may account for this slow speed. See the graph for the write speed according to this data (~5MB/s).
TIme Write duration Sector Length MB/sec
95.6 0.00208855 1490439632 896 0.409131784
95.6 0.00208855 1488197000 128 0.058447398
95.6 0.00009537 1482323640 128 1.279965529
95.6 0.00009537 1482336312 768 7.679793174
95.6 0.00009537 1482343992 384 3.839896587
95.6 0.00009537 1482350648 768 7.679793174
95.6 0.00039101 1489278984 1152 2.809730729
95.6 0.00039101 1489393672 896 2.185346123
95.6 0.0001812 1482349368 256 1.347354443
95.6 0.0001812 1482358328 896 4.715740549
95.6 0.0001812 1482370616 640 3.368386107
95.6 0.0001812 1482378040 256 1.347354443
95.6 0.00208855 1488197128 384 0.175342193
95.6 0.00208855 1488202512 640 0.292236989
95.6 0.00208855 1488210320 1024 0.467579182
95.6 0.00009537 1482351416 256 2.559931058
95.6 0.00009537 1482360120 896 8.959758703
95.6 0.00009537 1482371896 640 6.399827645
95.6 0.00009537 1482380088 256 2.559931058
95.7 0.00039101 1489394568 1152 2.809730729
95.7 0.00039101 1489396744 352 0.858528834
95.7 0.00039101 1489507944 544 1.326817289
95.7 0.0001812 1482378296 768 4.042063328
95.7 0.0001812 1482392120 768 4.042063328
95.7 0.0001812 1482400568 512 2.694708885
95.7 0.00208855 1488224144 768 0.350684386
95.7 0.00208855 1488232208 384 0.175342193
I'm pretty confident it's not my code, because I timed everything and for example enqueing takes a few us suggesting that threads don't get stuck waiting for each other. It must be the disk writes. So the question is how can I improve my disk writes and what can I do to profile actual disk writes (remember that I rely on FFmpeg dlls to save so I cannot access the low level writing functions directly). If I cannot figure it out, I'll dump all the frames in a single sequential binary file (which should increase I/O speed) and then split it into video files post processing.
I don't know how much my disk I/O is caching (CacheSet only shows disk C cache size), but the following image from the performance monitor taken at 0 and 45 sec into the video (just before my queue starts piling up) looks weird to me. Basically, the modified set and standby set grew from very little to this large value. Is that the data being cached? Is it possible that only at 45 sec data is starting to be written to disk so suddenly everything slows down?
(FYI, LabVIEW is the program that loads my dll.)
I'll appreciate any help.
M.

With CreateFile it looks like you want one or both of these parameters:
FILE_FLAG_NO_BUFFERING
FILE_FLAG_WRITE_THROUGH
http://msdn.microsoft.com/en-us/library/cc644950(v=vs.85).aspx
Your delayed performance hit occurs when the OS starts pushing data to the disk.
6Gb/s is the performance capability of the SATA 2 bus not the actual devices connected or the physical platters or flash ram underneath.
A common problem with AV systems is constantly writing a high stream of data can get periodically interrupted by disk overhead tasks. There used to be special AV disks you can purchase that don't do this, these days you can purchase disks with special high throughput performance firmware explicitly for security video recording.
http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=210671&NewLang=en

The problem is with repeated malloc and free which puts a load on the system. I suggest creating a buffer pools, i.e allocate N buffers in the initialization stage and reuse them instead of mallocing and freeing the memory. Since you have mentioned ffmpeg, to give an example from multimedia, In gstreamer, buffer management occurs in the form of buffer-pools and in a gstreamer pipeline buffers are usually taken and passed around from buffer pools. Most multimedia systems do this.
Regarding:
The problem is that for the first 45 sec everything works fine: there's never more than one frame on queue. But after that my queue gets backed up, processing takes just a few ms longer resulting in increased ram usage because I cannot save the frames fast enough so I have to malloc more memory to store them.
The application is trashing at this point. Calling malloc at this point will make matters even worse. I suggest implementing a producer-consumer model, where one of them gets waits depending on the case. In your case, set up a threshold of N buffers. If there are N buffers in the queue, new frames from camera are not enqueued till the existing buffers are processed.
Another idea, Instead of writing raw frames why not write encoded data? assuming you a want video, you can at least write a elementary H264 stream (and ffmpeg comes with a good H264 encoder!) or even better if you have access to a Mpeg-4 muxer, as a mp4 file? This will reduce the memory requirements and the IO load dramatically.

I am copying a directory with 100,000+ files to an exFat disk. The copy starts at around 60 MB/s and degrades to about 1-2 Mb/s. For me the quick & dirty fix is to put the computer to sleep during the operation and then wake it up. The original speed returns instantly. Not very sophisticated but it works for me.

I think it's finally a consequence of using an "unwise" write mode.
Why does it work with 8 threads writing in your synthetic test? Well, because in that case, little does it matter if a thread blocks. You're pushing data towards the drive full throttle. It's not surprising that you get 100 MB/s that way, which is roughly the reported real-life speed of the drive that you named.
What happens in the "real" program? You have 8 producers and a single consumer that pushes data to disk. You use non-buffered, synchronized writes. Which means as much as for as long as it takes for the drive to receive all data, your writer thread blocks, it doesn't do anything. Producers keep producing. Well that is fine, says you. You only need 50 MB/s or such, and the drive can do 100 MB/s.
Yes, except... that's the maximum with several threads hammering the drive concurrently. You have zero concurrency here, so you don't get those 100 MB/s already. Or maybe you do, for a while.
And then, the drive needs to do something. Something? Something, anything, whatever, say, a seek. Or, you get scheduled out for an instant, or for whatever reason the controller or OS or whatever doesn't push data for a tenth millisecond, and when you look again next, the disk has spun on, so you need to wait one rotation (no joke, that happens!). Maybe there's just one or two sectors in the way on a fragmented disk, who knows. It doesn't matter what... something happens.
This takes, say, anywhere from one to ten milliseconds, and your thread does nothing during that time, it's still waiting for the blocking WriteFile call to finish already. Producers keep producing and keep pulling memory blocks from the pool. The pool runs out of memory blocks, and allocates some more. Producers keep producing, and producing. Meanwhile, your writer thread keeps the buffer locked, and still does nothing.
See where this is going? That cannot be sustained forever. Even if it temporarily recovers, it's always fighting uphill. And losing.
Caching (also write caching) doesn't exist for no reason. Things in the real world do not always go as fast as we wish, both in terms of bandwidth and latency.
On the internet, we can easily crank up bandwidth to gigabit uplinks, but we are bound by latency, and there is absolutely nothing that we can do about it. That's why trans-atlantic communications still suck like they did 30 years ago, and they will still suck in 30 years (the speed of light very likely won't change).
Inside a computer, things are very similar. You can easily push two-digit gigabytes per second over a PCIe bus, but doing so in the first place takes a painfully long time. Graphics programmers know that only too well.
The same applies to disks, and it is no surprise that exactly the same strategy is applied to solve the issue: Overlap transfers, do them asynchronously. That's what virtually every operating system does by default. Data is copied to buffers, and written out lazily, and asynchronously.
That, however, is exactly what you disable by using unbuffered writes. Which is almost always (with very, very few singular exceptions) a bad idea.
For some reason, people have kept around this mindset of DMA doing magic things and being so totally superior speed-wise. Which, maybe, was even true at some point in the distant past, and maybe still is true for some very rare corner cases (e.g. read-once from optical disks). In all other cases, it's a desastrous anti-optimization.
Things that could run in parallel run sequentially. Latencies add up. Threads block, and do nothing.
Simply using "normal" writes means data is copied to buffers, and WriteFile returns almost instantly (well, with a delay equivalent to a memcpy). The drive never (well, hopefully, can never be 100% sure) starves, and your writer thread never blocks, doing nothing. Memory blocks don't stack up in the queue, and pools don't run out, they don't need to allocate more.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js