Memory-compute overlap issue in cuda

Memory-compute overlap issue in cuda - c++

I have a CUDA kernel that process a lot of data.
As I cannot transfer all the data at once I have to split them into chunks and process them chuck by chunk and update the output on the GPU.
I am parsing the input data from a file.
I was thinking if i could overlap the chunks' memory transfers by having two buffers both in the host and in the GPU. While processing one chuck, I could read the other, transfer it to the GPU and launch the kernel to the same stream.
My problem is that the kernel's execution time is slower than parsing the data and transferring them to the GPU. How can I ensure that the memcpys won't write over the data that the kernel uses given the fact that memcpys are no blocking?
//e.g. Pseudocode
//for every chunk
//parse data
//cudaMemcpyAsync ( dev, host, size, H2D )
//launch kernel
//switch_buffer
//copy result from device to host
Thank you in advance.

Just insert an explicit sync point with cudaDeviceSynchronize() after the kernel launch.
That way, you are essentially starting a memory transfer and launching a kernel at the same time. The transfer would go to one buffer and the kernel would work on the other. The cudaDeviceSynchronize() would wait until both were done, at which time you would swap the buffers and repeat.
Of course, you also need to copy the results from the device to the host within the loop and add logic to handle the first iteration, when there's no data for the kernel to process yet, and the last iteration, when there's no more data to copy but still one buffer to be processed. This can be done with logic within the loop or by partially unrolling the loop, to specifically code the first and last iterations.
Edit:
By moving the sync point to just before the cudaMemcpyAsync() and after the file read and parse, you allow the kernel to also overlap that part of the processing (if the kernel runs long enough).

Related

Low performance when recording command buffers with multiple threads

So basically I've spent the past two weeks trying to figure out why my multithreaded command buffer recording has been so slow, and I'm completely stumped. The problem is, when I try to offload work to multiple threads to record my command buffers, I end up with lower performance than I would have if I had used a single thread. I am very sure that this is an issue with how I have set up my threads/jobs
What I'm doing at the moment is when I begin preparing for the next frame, I tell X amount of threads to record X amount of draw commands in a secondary command buffer (one cmd buffer per thread) All of the work is spread evenly between the X amount of threads that I'm using, however, after keeping track of the times that it takes for each thread to do it work, I noticed that the threads did not finish their jobs any faster when I included more worker threads to do a portion of the work. For example, telling 2 threads to record 100 draw commands each would finish their work in about 5ms, but if I tell 4 threads to record 50 draw commands each they would still finish their work in the same amount of time as it took the 2 threads despite having more threads doing less work in each thread.
I'm not really sure what I'm doing wrong here
My job function looks like this
void doRender(CommandBuffer* writeBuffer, VkCommandBufferInheritanceInfo inheritance, Pipeline* pipelineIn, MeshBuffer* meshBufferIn, dragonsbreath4::World* worldIn, int start, int end) {
writeBuffer->begin(VK_COMMAND_BUFFER_USAGE_RENDER_PASS_CONTINUE_BIT, &inheritance);
writeBuffer->bindPipeline(pipelineIn);
writeBuffer->bindMeshBuffer(meshBufferIn);
writeBuffer->draw(worldIn->getEntityList(), start, end);
writeBuffer->finish();
}
The writebuffer is a pointer to the secondary command buffer that will be recorded.
The pipeline is a pointer to the graphics pipeline used by the application.
The meshbuffer is a pointer to an object containing buffers with all of the vertex data and index data.
The world is a pointer to the world that will be drawn.
The start/end is just where on the entity list to begin drawing from and where to end.
I'm pretty sure the issue lies with the fact that most of these resources are accessed by every thread that is recording commands (pipeline, world, meshbuffer)
I hope this is enough information if not I can include more. Any help here would be appreciated!!!

Idendify the reason for a 200 ms freezing in a time critical loop

New description of the problem:
I currently run our new data acquisition software in a test environment. The software has two main threads. One contains a fast loop which communicates with the hardware and pushes the data into a dual buffer. Every few seconds, this loop freezes for 200 ms. I did several tests but none of them let me figure out what the software is waiting for. Since the software is rather complex and the test environment could interfere too with the software, I need a tool/technique to test what the recorder thread is waiting for while it is blocked for 200 ms. What tool would be useful to achieve this?
Original question:
In our data acquisition software, we have two threads that provide the main functionality. One thread is responsible for collecting the data from the different sensors and a second thread saves the data to disc in big blocks. The data is collected in a double buffer. It typically contains 100000 bytes per item and collects up to 300 items per second. One buffer is used to write to in the data collection thread and one buffer is used to read the data and save it to disc in the second thread. If all the data has been read, the buffers are switched. The switch of the buffers seems to be a major performance problem. Each time the buffer switches, the data collection thread blocks for about 200 ms, which is far too long. However, it happens once in a while, that the switching is much faster, taking nearly no time at all. (Test PC: Windows 7 64 bit, i5-4570 CPU #3.2 GHz (4 cores), 16 GB DDR3 (800 MHz)).
My guess is, that the performance problem is linked to the data being exchanged between cores. Only if the threads run on the same core by chance, the exchange would be much faster. I thought about setting the thread affinity mask in a way to force both threads to run on the same core, but this also means, that I lose real parallelism. Another idea was to let the buffers collect more data before switching, but this dramatically reduces the update frequency of the data display, since it has to wait for the buffer to switch before it can access the new data.
My question is: Is there a technique to move data from one thread to another which does not disturb the collection thread?
Edit: The double buffer is implemented as two std::vectors which are used as ring buffers. A bool (int) variable is used to tell which buffer is the active write buffer. Each time the double buffer is accessed, the bool value is checked to know which vector should be used. Switching the buffers in the double buffer just means toggling this bool value. Of course during the toggling all reading and writing is blocked by a mutex. I don't think that this mutex could possibly be blocking for 200 ms. By the way, the 200 ms are very reproducible for each switch event.

Locking and releasing a mutex just to switch one bool variable will not take 200ms.
Main problem is probably that two threads are blocking each other in some way.
This kind of blocking is called lock contention. Basically this occurs whenever one process or thread attempts to acquire a lock held by another process or thread. Instead parallelism you have two thread waiting for each other to finish their part of work, having similar effect as in single threaded approach.
For further reading I recommend this article for a read, which describes lock contention with more detailed level.

Since you are running on windows maybe you use visual studio? if yes I would resort to VS profiler which is quite good (IMHO) in such cases, once you don't need to check data/instruction caches (then the Intel's vTune is a natural choice). From my experience VS is good enough to catch contention problems as well as CPU bottlenecks. you can run it directly from VS or as standalone tool. you don't need the VS installed on your test machine you can just copy the tool and run it locally.
VSPerfCmd.exe /start:SAMPLE /attach:12345 /output:samples - attach to process 12345 and gather CPU sampling info
VSPerfCmd.exe /detach:12345 - detach from process
VSPerfCmd.exe /shutdown - shutdown the profiler, the samples.vsp is written (see first line)
then you can open the file and inspect it in visual studio. if you don't see anything making your CPU busy switch to contention profiling - just change the "start" argument from "SAMPLE" to "CONCURRENCY"
The tool is located under %YourVSInstallDir%\Team Tools\Performance Tools\, AFAIR it is available from VS2010
Good luck

After discussing the problem in the chat, it turned out that the Windows Performance Analyser is a suitable tool to use. The software is part of the Windows SDK and can be opened using the command wprui in a command window. (Alois Kraus posted this useful link: http://geekswithblogs.net/akraus1/archive/2014/04/30/156156.aspx in the chat). The following steps revealed what the software had been waiting on:
Record information with the WPR using the default settings and load the saved file in the WPA.
Identify the relevant thread. In this case, the recording thread and the saving thread obviously had the highest CPU load. The saving thread could be easily identified. Since it saves data to disc, it is the one that with file access. (Look at Memory->Hard Faults)
Check out Computation->CPU usage (Precise) and select Utilization by Process, Thread. Select the process you are analysing. Best display the columns in the order: NewProcess, ReadyingProcess, ReadyingThreadId, NewThreadID, [yellow bar], Ready (µs) sum, Wait(µs) sum, Count...
Under ReadyingProcess, I looked for the process with the largest Wait (µs) since I expected this one to be responsible for the delays.
Under ReadyingThreadID I checked each line referring to the thread with the delays in the NewThreadId column. After a short search, I found a thread that showed frequent Waits of about 100 ms, which always showed up as a pair. In the column ReadyingThreadID, I was able to read the id of the thread the recording loop was waiting for.
According to its CPU usage, this thread did basically nothing. In our special case, this led me to the assumption that the serial port io command could cause this wait. After deactivating them, the delay was gone. The important discovery was that the 200 ms delay was in fact composed of two 100 ms delays.
Further analysis showed that the fetch data command via the virtual serial port pair gets sometimes lost. This might be linked to very high CPU load in the data saving and compression loop. If the fetch command gets lost, no data is received and the first as well as the second attempt to receive the data timed out with their 100 ms timeout time.

Multithreading a File Map into an Array of Buffers

I'm trying to work with nasty large xml and text documents: ~40GBs.
I'm using Visual Studio 2012 on Windows 7.
I'm going to use 'Xerces' to snag the header/'footer tag' from the xmls.
I want to map an area of the file, say.. 60-120MBs.
Split the Map into (3 * processors/cores) equal parts. Setting each part as a buffer and loading the buffers into an array.
Then using (#processors/cores) while statments in new threads, I will synchronously count characters/lines/xml cycles while chewing through the the buffer array. When one buffer is completed the the process will jump to the next 'available' buffer and the completed buffer will be dropped out of memory. At the end I will add the total results into a project log.
Afterwards, I will reference the log, Split the files by character count/size(Or other option) to the nearest line or cycle and drop in the header and 'footer tag' to all the splits.
I'm doing this so I can import massive data to a MySQL server over a network with multiple computers.
My Question is, how do I create the buffer array and the file map with new threads?
Can I use :
win CreateFile
win CreateFileMapping
win MapViewOfFile
with standard ifstream operations and char buffers or should I opt something else?
Futher clarification:
My thinking is that if I can have the hard drive streaming the file into memory from one place and in one direction that I can use the full processing power of the machine to chew through seperate but equal buffers.
~Flavor: It's kind of like being a Shepard trying to scoop food out from one huge bin with 3-6 Large buckets with only two arms for X sheep that need to stay inside the fenced area. But they all move at the speed of light.
A few ideas or pointers might help me along here.
Any thoughts are Most Welcome. Thanks.
while(getline(my_file, myStr))
{
characterCount += myStr.length();
lineCount++;
if(my_file.eof()){
break;
}
}
This was the only code at run time for the test. 2hours, 30+min. 45-50% total processor for the program running it on a dual core 1.6Mhz laptop with 2GB RAM. Most of the RAM loaded right now is 600+MB from ~50 tabs open in firefox, Visual Studio at 60MB, then etcs.
IMPORTANT: During the test, the program running the code, which is only a window, and a dialog box, seemed to dump it's own working and private set of ram, down to like 300K ish, and didn't respond for the length of the test. I need to make another thread for the while statement I'm sure. But this means that NONE of the file was read into a buffer. The CPU was struggling for the entire run to keep up with the tinyest effort from the hard drive.
P.S. Further proof of CPU bottlenecking. It might take me 20min to transfer than entire file to another computer over my wireless network. Which includes the read process and a socket catch to write process on the other computer.
UPDATE
I used this adorable little thing to go from the previous test time to about 15-20min which is in line with what Mats Petersson was saying.
while (my_file.read( &bufferOne[0], bufferOne.size() ))
{
int cc = my_file.gcount();
for (int i = 0; i < cc; i++)
{
if (bufferOne[i] == '\n')
lineCount++;
characterCount++;
}
currentPercent = characterCount/onePercent;
SendMessage(GetDlgItem(hDlg, IDC_GENPROGRESS), PBM_SETPOS, currentPercent, 0);
}
Granted this is a single loop and it actually behaved much more appropriately than the previous test. This test was ~800% faster than the tight loop shown above this one with Getline. I set the buffer for this loop at 20MB. I jacked this code from: SOF - Fastest Example
BUT...
I would like to point out that while polling the process in resource mon and task manager, it clearly showed the first core at 75-90% usage, the second fluxuately 25-50% (Pretty standard for some minor background stuff that I have open), and the hard drive at.. wait for it... 50%. Some 100% disk time spikes but also some lows at 25%. All of which basically means that Splitting the buffer processing between two different threads could very well be a benefit. It will use all the system resources but.. that's what I want. I'll update later today when I have the working prototype.
MAJOR UPDATE:
Finally finished my project after a bunch of learning. No File Map needed. Only a bunch of vector char's. I have successfully built a dynamically executing file stream line and character counter.
The good news, went from the previous 10-15min marker to ~3-4min on a 5.8GB file, BOOYA!~

Very short answer: Yes, you can use those functions.
For reading data, it's likely the most efficient method to map the file content into memory, since it saves having to copy the memory into a buffer in the application, just read it straight into the place it's supposed to go. So, no problem as long as you have enough address space available - 64-bit machines should certainly have plenty, in a 32-bit system it may be more of a scarce resource - but for sections of a few hundred MB, it shouldn't be a huge issue.
However, using multiple threads, I'm not at all convinced. I have a fair idea that reading more than one part of a very large file will be counter productive. This will increase the amount of head movement on the disk, which is a large portion of transfer rate. You can count on some 50-100MB/s transfer rates for "ordinary" systems. If the system has some sort of raid controller or some such, maybe around double that - very exotic raid controllers may achieve three times.
So reading 40GB will take somewhere in the order of 3-15 minutes.
The CPU is probably not going to be very busy, and running multiple threads is quite likely to worsen the overall performance of the system.
You may want to keep a thread for reading and one for writing, and only actually write out the data once you have a sufficient amount of it, again, to avoid unnecessary moves of the read/write head on the disk(s).

Replaying stored data at a fixed rate

I am working on a problem where I want to replayed data stored in a file at a specified rate.
For Eg: 25,000 records/second.
The file is in ascii format. Currently, I read each line of the file and apply a regex to
extract the data. 2- 4 lines make up a record. I timed this operation and it takes close to
15 microseconds for generating each record.
The time taken to publish each record is 6 microseconds.
If I perform the reading and writing sequentially, then I would end up with 21 microseconds to publish each record. So effectively, this means my upper bound is ~47K records per second.
If I decide to multi thread the reading and writing then I will be able to send out a packet every 9 microsecond ( neglecting the locking penalty since reader and writer share the same Q ) which gives a throughput of 110K ticks per second.
Is my previous design correct ?
What kind of Queue and locking construct has minimum penalty when a single producer and consumer share a queue ?
If I would like to scale beyond this what's the best approach ?
My application is in C++

If it takes 15uS to read/prepare a record then your maximum throughput will be about 1sec/15uSec = 67k/sec. You can ignore the 6uSec part as the single thread reading the file cannot generate more records than that. (try it, change the program to only read/process and discard the output) not sure how you got 9uS.
To make this fly beyond 67k/sec ...
A) estimate the maximum records per second you can read from the disk to be formatted. While this depends on hardware a lot, a figure of 20Mb/sec is typical for an average laptop. This number will give you the upper bound to aim for, and as you get close you can ease off trying.
B) create a single thread just to read the file and incur the IO delay. This thread should write to large preallocated buffers, say 4Mb each. See http://en.wikipedia.org/wiki/Circular_buffer for a way of managing these. You are looking to hold maybe 1000 records per buffer (guess, but not just 8 ish records!) pseudo code:
while not EOF
Allocate big buffer
While not EOF and not buffer full
Read file using fgets() or whatever
Apply only very small preprocessing, ideally none
Save into buffer
Release buffer for other threads
C) create another thread ( or several if the order of records is not important) to process a ring buffer when it is full, your regex step. This thread in turn writes to another set of output ring buffers (tip, keep the ring buffer control structures apart in memory)
While run-program
Wait/get an input buffer to process, semaphores/mutex/whatever you prefer
Allocate output buffer
Process records from input buffer,
Place result in output buffer
Release output buffer for next thread
Release input buffer for reading thread
D) create you final thread to consume the data. It isn't clear if this output is being written to disk or network, so this might affect the disk reading thread.
Wait/get input buffer from processed records pool
Output records to wherever
Return buffer to processed records pool
Notes.
Preallocate all buffers and pass them back to where they came from. Eg you might have 4 buffers between file reading thread and processing threads, when all 4 are infuse, the file reader waits for one to be free, it doesn't just allocate new buffers.
Try not to memset() buffers if you can avoid it, waste of memory bandwidth.
You won't need many buffers, 6? Per ring buffer?
The system will auto tune to slowest thread ( http://en.wikipedia.org/wiki/Theory_of_constraints ) so if you can read and prepare data faster than you want to output it, all the buffers will fill up and everything will pause except the output.
As the threads are passing reasonable amounts of data each sync point, the overhead of this will not matter too much.
The above design is how some of my code reads CSV files as quick as possible, basically it all comes to to input IO bandwidth as limiting factor.

Loading large multi-sample audio files into memory for playback - how to avoid temporary freezing

I am writing an application needs to use large audio multi-samples, usually around 50 mb in size. One file contains approximately 80 individual short sound recordings, which can get played back by my application at any time. For this reason all the audio data gets loaded into memory for quick access.
However, when loading one of these files, it can take many seconds to put into memory, meaning my program if temporarily frozen. What is a good way to avoid this happening? It must be compatible with Windows and OS X. It freezes at this : myMultiSampleClass->open(); which has to do a lot of dynamic memory allocation and reading from the file using ifstream.
I have thought of two possible options:
Open the file and load it into memory in another thread so my application process does not freeze. I have looked into the Boost library to do this but need to do quite a lot of reading before I am ready to implement. All I would need to do is call the open() function in the thread then destroy the thread afterwards.
Come up with a scheme to make sure I don't load the entire file into memory at any one time, I just load on the fly so to speak. The problem is any sample could be triggered at any time. I know some other software has this kind of system in place but I'm not sure how it works. It depends a lot on individual computer specifications, it could work great on my computer but someone with a slow HDD/Memory could get very bad results. One idea I had was to load x samples of each audio recording into memory, then if I need to play, begin playback of the samples that already exist whilst loading the rest of the audio into memory.
Any ideas or criticisms? Thanks in advance :-)

Use a memory mapped file. Loading time is initially "instant", and the overhead of I/O will be spread over time.

I like solution 1 as a first attempt -- simple & to the point.
If you are under Windows, you can do asynchronous file operations -- what they call OVERLAPPED -- to tell the OS to load a file & let you know when it's ready.

i think the best solution is to load a small chunk or single sample of wave data at a time during playback using asynchronous I/O (as John Dibling mentioned) to a fixed size of playback buffer.
the strategy will be fill the playback buffer first then play (this will add small amount of delay but guarantees continuous playback), while playing the buffer, you can re-fill another playback buffer on different thread (overlapped), at least you need to have two playback buffer, one for playing and one for refill in the background, then switch it in real-time
later you can set how large the playback buffer size based on client PC performance (it will be trade off between memory size and processing power, fastest CPU will require smaller buffer thus lower delay).

You might want to consider a producer-consumer approach. This basically involved reading the sound data into a buffer using one thread, and streaming the data from the buffer to your sound card using another thread.
The data reader is the producer, and streaming the data to the sound card is the consumer. You need high-water and low-water marks so that, if the buffer gets full, the producer stops reading, and if the buffer gets low, the producer starts reading again.
A C++ Producer-Consumer Concurrency Template Library
http://www.bayimage.com/code/pcpaper.html
EDIT: I should add that this sort of thing is tricky. If you are building a sample player, the load on the system varies continuously as a function of which keys are being played, how many sounds are playing at once, how long the duration of each sound is, whether the sustain pedal is being pressed, and other factors such as hard disk speed and buffering, and amount of processor horsepower available. Some programming optimizations that you eventually employ will not be obvious at first glance.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js