I'm new to CUDA and I'm trying to analyse the performance of two GPUs (RTX 3090; 48GB vRAM) in parallel. The issue I face is that for the simple block of code shown below, I would expect this overall block to complete at the same time regardless of the presence of Device 2 code, as they are running Asynchronously on different streams.
// aHost, bHost, cHost, dHost are pinned memory. All arrays are of same length.
for(int i = 0; i < 2; i++){
// ---------- Device 1 code -----------
cudaSetDevice(0);
cudaMemcpyAsync(aDest, aHost, N* sizeof(float), cudaMemcpyHostToDevice, stream1);
cudaMemcpyAsync(bDest, bHost, N* sizeof(float), cudaMemcpyHostToDevice, stream1);
// ---------- Device 2 code -----------
cudaSetDevice(1);
cudaMemcpyAsync(cDest, cHost, N* sizeof(float), cudaMemcpyHostToDevice,stream2);
cudaStreamSynchronize(stream1);
cudaStreamSynchronize(stream2);
}
But alas, when I do end up running the block, running Device 1 code alone takes 80ms but adding Device 2 code to the above block adds 20ms, thus reaching 100ms as execution time for the block. I tried profiling the above code and observed the following:
Device 1 + Device 2 Concurrently (image)
When I run Device 1 alone though, I get the below profile:
Device 1 alone (image)
I can see that the initial HtoD process of Device 1 is extended in duration when I add Device 2, and I'm not sure why this is happening cause as far as I'm concerned, these processes are running independently, on different GPUs.
I realise that I haven't created any seperate CPU threads to handle seperate devices but I'm not sure if that would help. Could someone please help me understand why this elongation of duration happens when I add Device 2 code?
EDIT:
Tried profiling the code, and expected the execution durations to be independent of GPU, although I realise MemCpyAsync involves the host as well and perhaps the addition of Device 2 gives rise to more stress on the CPU as it now has to handle additional transfers...?
I have code which gets frames from a camera and then saves it to disk. The structure of the code is: multiple threads malloc and copy their frames into new memory, enqueue memory. Finally, another thread removes frames from queue and writes them (using ffmpeg API, raw video no compression) to their files (actually I'm using my own memory pool so malloc is only called when I need more buffers). I can have upto 8 files/cams open at the same time enqueing.
The problem is that for the first 45 sec everything works fine: there's never more than one frame on queue. But after that my queue gets backed up, processing takes just a few ms longer resulting in increased ram usage because I cannot save the frames fast enough so I have to malloc more memory to store them.
I have a 8 core, 16GB RAM Windows 7 64 bit computer (NTFS, lots of free space in second disk drive). The disk is supposed to be able to write upto 6Gbits/sec. To save my data in time I need to be able to write at 50 MB/sec. I tested disk speed with "PassMark PerformanceTest" and I had 8 threads writing files simultaneously exactly like ffmpeg saves files (synchronized, uncached I/O) and it was able to achieve 100MB/sec. So why isn't my writes able to achieve that?
Here's how the ffmpeg writes look from Process monitor logs:
Time of Day Operation File# Result Detail
2:30:32.8759350 PM WriteFile 8 SUCCESS Offset: 749,535,120, Length: 32,768
2:30:32.8759539 PM WriteFile 8 SUCCESS Offset: 749,567,888, Length: 32,768
2:30:32.8759749 PM WriteFile 8 SUCCESS Offset: 749,600,656, Length: 32,768
2:30:32.8759939 PM WriteFile 8 SUCCESS Offset: 749,633,424, Length: 32,768
2:30:32.8760314 PM WriteFile 8 SUCCESS Offset: 749,666,192, Length: 32,768
2:30:32.8760557 PM WriteFile 8 SUCCESS Offset: 749,698,960, Length: 32,768
2:30:32.8760866 PM WriteFile 8 SUCCESS Offset: 749,731,728, Length: 32,768
2:30:32.8761259 PM WriteFile 8 SUCCESS Offset: 749,764,496, Length: 32,768
2:30:32.8761452 PM WriteFile 8 SUCCESS Offset: 749,797,264, Length: 32,768
2:30:32.8761629 PM WriteFile 8 SUCCESS Offset: 749,830,032, Length: 32,768
2:30:32.8761803 PM WriteFile 8 SUCCESS Offset: 749,862,800, Length: 32,768
2:30:32.8761977 PM WriteFile 8 SUCCESS Offset: 749,895,568, Length: 32,768
2:30:32.8762235 PM WriteFile 8 SUCCESS Offset: 749,928,336, Length: 32,768, Priority: Normal
2:30:32.8762973 PM WriteFile 8 SUCCESS Offset: 749,961,104, Length: 32,768
2:30:32.8763160 PM WriteFile 8 SUCCESS Offset: 749,993,872, Length: 32,768
2:30:32.8763352 PM WriteFile 8 SUCCESS Offset: 750,026,640, Length: 32,768
2:30:32.8763502 PM WriteFile 8 SUCCESS Offset: 750,059,408, Length: 32,768
2:30:32.8763649 PM WriteFile 8 SUCCESS Offset: 750,092,176, Length: 32,768
2:30:32.8763790 PM WriteFile 8 SUCCESS Offset: 750,124,944, Length: 32,768
2:30:32.8763955 PM WriteFile 8 SUCCESS Offset: 750,157,712, Length: 32,768
2:30:32.8764072 PM WriteFile 8 SUCCESS Offset: 750,190,480, Length: 4,104
2:30:32.8848241 PM WriteFile 4 SUCCESS Offset: 750,194,584, Length: 32,768
2:30:32.8848481 PM WriteFile 4 SUCCESS Offset: 750,227,352, Length: 32,768
2:30:32.8848749 PM ReadFile 4 END OF FILE Offset: 750,256,128, Length: 32,768, I/O Flags: Non-cached, Paging I/O, Synchronous Paging I/O, Priority: Normal
2:30:32.8848989 PM WriteFile 4 SUCCESS Offset: 750,260,120, Length: 32,768
2:30:32.8849157 PM WriteFile 4 SUCCESS Offset: 750,292,888, Length: 32,768
2:30:32.8849319 PM WriteFile 4 SUCCESS Offset: 750,325,656, Length: 32,768
2:30:32.8849475 PM WriteFile 4 SUCCESS Offset: 750,358,424, Length: 32,768
2:30:32.8849637 PM WriteFile 4 SUCCESS Offset: 750,391,192, Length: 32,768
2:30:32.8849880 PM WriteFile 4 SUCCESS Offset: 750,423,960, Length: 32,768, Priority: Normal
2:30:32.8850400 PM WriteFile 4 SUCCESS Offset: 750,456,728, Length: 32,768
2:30:32.8850727 PM WriteFile 4 SUCCESS Offset: 750,489,496, Length: 32,768, Priority: Normal
This looks very efficient, however, from DiskMon the actual disk writes look ridiculously fragmented back and forth writing which may account for this slow speed. See the graph for the write speed according to this data (~5MB/s).
TIme Write duration Sector Length MB/sec
95.6 0.00208855 1490439632 896 0.409131784
95.6 0.00208855 1488197000 128 0.058447398
95.6 0.00009537 1482323640 128 1.279965529
95.6 0.00009537 1482336312 768 7.679793174
95.6 0.00009537 1482343992 384 3.839896587
95.6 0.00009537 1482350648 768 7.679793174
95.6 0.00039101 1489278984 1152 2.809730729
95.6 0.00039101 1489393672 896 2.185346123
95.6 0.0001812 1482349368 256 1.347354443
95.6 0.0001812 1482358328 896 4.715740549
95.6 0.0001812 1482370616 640 3.368386107
95.6 0.0001812 1482378040 256 1.347354443
95.6 0.00208855 1488197128 384 0.175342193
95.6 0.00208855 1488202512 640 0.292236989
95.6 0.00208855 1488210320 1024 0.467579182
95.6 0.00009537 1482351416 256 2.559931058
95.6 0.00009537 1482360120 896 8.959758703
95.6 0.00009537 1482371896 640 6.399827645
95.6 0.00009537 1482380088 256 2.559931058
95.7 0.00039101 1489394568 1152 2.809730729
95.7 0.00039101 1489396744 352 0.858528834
95.7 0.00039101 1489507944 544 1.326817289
95.7 0.0001812 1482378296 768 4.042063328
95.7 0.0001812 1482392120 768 4.042063328
95.7 0.0001812 1482400568 512 2.694708885
95.7 0.00208855 1488224144 768 0.350684386
95.7 0.00208855 1488232208 384 0.175342193
I'm pretty confident it's not my code, because I timed everything and for example enqueing takes a few us suggesting that threads don't get stuck waiting for each other. It must be the disk writes. So the question is how can I improve my disk writes and what can I do to profile actual disk writes (remember that I rely on FFmpeg dlls to save so I cannot access the low level writing functions directly). If I cannot figure it out, I'll dump all the frames in a single sequential binary file (which should increase I/O speed) and then split it into video files post processing.
I don't know how much my disk I/O is caching (CacheSet only shows disk C cache size), but the following image from the performance monitor taken at 0 and 45 sec into the video (just before my queue starts piling up) looks weird to me. Basically, the modified set and standby set grew from very little to this large value. Is that the data being cached? Is it possible that only at 45 sec data is starting to be written to disk so suddenly everything slows down?
(FYI, LabVIEW is the program that loads my dll.)
I'll appreciate any help.
M.
With CreateFile it looks like you want one or both of these parameters:
FILE_FLAG_NO_BUFFERING
FILE_FLAG_WRITE_THROUGH
http://msdn.microsoft.com/en-us/library/cc644950(v=vs.85).aspx
Your delayed performance hit occurs when the OS starts pushing data to the disk.
6Gb/s is the performance capability of the SATA 2 bus not the actual devices connected or the physical platters or flash ram underneath.
A common problem with AV systems is constantly writing a high stream of data can get periodically interrupted by disk overhead tasks. There used to be special AV disks you can purchase that don't do this, these days you can purchase disks with special high throughput performance firmware explicitly for security video recording.
http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=210671&NewLang=en
The problem is with repeated malloc and free which puts a load on the system. I suggest creating a buffer pools, i.e allocate N buffers in the initialization stage and reuse them instead of mallocing and freeing the memory. Since you have mentioned ffmpeg, to give an example from multimedia, In gstreamer, buffer management occurs in the form of buffer-pools and in a gstreamer pipeline buffers are usually taken and passed around from buffer pools. Most multimedia systems do this.
Regarding:
The problem is that for the first 45 sec everything works fine: there's never more than one frame on queue. But after that my queue gets backed up, processing takes just a few ms longer resulting in increased ram usage because I cannot save the frames fast enough so I have to malloc more memory to store them.
The application is trashing at this point. Calling malloc at this point will make matters even worse. I suggest implementing a producer-consumer model, where one of them gets waits depending on the case. In your case, set up a threshold of N buffers. If there are N buffers in the queue, new frames from camera are not enqueued till the existing buffers are processed.
Another idea, Instead of writing raw frames why not write encoded data? assuming you a want video, you can at least write a elementary H264 stream (and ffmpeg comes with a good H264 encoder!) or even better if you have access to a Mpeg-4 muxer, as a mp4 file? This will reduce the memory requirements and the IO load dramatically.
I am copying a directory with 100,000+ files to an exFat disk. The copy starts at around 60 MB/s and degrades to about 1-2 Mb/s. For me the quick & dirty fix is to put the computer to sleep during the operation and then wake it up. The original speed returns instantly. Not very sophisticated but it works for me.
I think it's finally a consequence of using an "unwise" write mode.
Why does it work with 8 threads writing in your synthetic test? Well, because in that case, little does it matter if a thread blocks. You're pushing data towards the drive full throttle. It's not surprising that you get 100 MB/s that way, which is roughly the reported real-life speed of the drive that you named.
What happens in the "real" program? You have 8 producers and a single consumer that pushes data to disk. You use non-buffered, synchronized writes. Which means as much as for as long as it takes for the drive to receive all data, your writer thread blocks, it doesn't do anything. Producers keep producing. Well that is fine, says you. You only need 50 MB/s or such, and the drive can do 100 MB/s.
Yes, except... that's the maximum with several threads hammering the drive concurrently. You have zero concurrency here, so you don't get those 100 MB/s already. Or maybe you do, for a while.
And then, the drive needs to do something. Something? Something, anything, whatever, say, a seek. Or, you get scheduled out for an instant, or for whatever reason the controller or OS or whatever doesn't push data for a tenth millisecond, and when you look again next, the disk has spun on, so you need to wait one rotation (no joke, that happens!). Maybe there's just one or two sectors in the way on a fragmented disk, who knows. It doesn't matter what... something happens.
This takes, say, anywhere from one to ten milliseconds, and your thread does nothing during that time, it's still waiting for the blocking WriteFile call to finish already. Producers keep producing and keep pulling memory blocks from the pool. The pool runs out of memory blocks, and allocates some more. Producers keep producing, and producing. Meanwhile, your writer thread keeps the buffer locked, and still does nothing.
See where this is going? That cannot be sustained forever. Even if it temporarily recovers, it's always fighting uphill. And losing.
Caching (also write caching) doesn't exist for no reason. Things in the real world do not always go as fast as we wish, both in terms of bandwidth and latency.
On the internet, we can easily crank up bandwidth to gigabit uplinks, but we are bound by latency, and there is absolutely nothing that we can do about it. That's why trans-atlantic communications still suck like they did 30 years ago, and they will still suck in 30 years (the speed of light very likely won't change).
Inside a computer, things are very similar. You can easily push two-digit gigabytes per second over a PCIe bus, but doing so in the first place takes a painfully long time. Graphics programmers know that only too well.
The same applies to disks, and it is no surprise that exactly the same strategy is applied to solve the issue: Overlap transfers, do them asynchronously. That's what virtually every operating system does by default. Data is copied to buffers, and written out lazily, and asynchronously.
That, however, is exactly what you disable by using unbuffered writes. Which is almost always (with very, very few singular exceptions) a bad idea.
For some reason, people have kept around this mindset of DMA doing magic things and being so totally superior speed-wise. Which, maybe, was even true at some point in the distant past, and maybe still is true for some very rare corner cases (e.g. read-once from optical disks). In all other cases, it's a desastrous anti-optimization.
Things that could run in parallel run sequentially. Latencies add up. Threads block, and do nothing.
Simply using "normal" writes means data is copied to buffers, and WriteFile returns almost instantly (well, with a delay equivalent to a memcpy). The drive never (well, hopefully, can never be 100% sure) starves, and your writer thread never blocks, doing nothing. Memory blocks don't stack up in the queue, and pools don't run out, they don't need to allocate more.