Using pthreads with CUDA - design questions

Using pthreads with CUDA - design questions - c++

I am writing some code that requires some disk I/O, and invoking a library that I wrote that does some computation and GPU work, and then more disk I/O to write the results back to a file.
I would like to create this as multi-threaded code, because the files are quite large. I want to be able to read in portion of the file, send it to the GPU library, and write a portion back to a file. The disk I/O involved is quite large (like 10GB), and the computation is fairly quick on the GPU.
My question is more of a design question. Should I use separate threads to pre-load data that goes to the GPU library, and only have the main thread actually execute the calls to the GPU library, and then send the resulting data to other threads to be written back out to disk, or should I go ahead and have all of the separate threads each do their own part - grab a chucnk of data, execute on the GPU, and write to disk, and then go get the next chunk of data?
I am using CUDA for my GPU library. Is cuda smart enough to not try to run two kernels on the GPU at once? I guess I will have to do the management manually to ensure that two threads dont try to add more data to the GPU than it has space?
Any good resources on the subject of multithreading and CUDA being used in combination is appreciated.

Threads will not help with disk I/O. Generally people tend to solve blocking problems by creating tons of threads. In fact, that only makes things worse. What you have to do is to use asynchronous I/O and not block on write (and read). You can use some generic solutions like libevent or Asio for this or work with lower-level API available on your platform. On Linux, AIO seems to be the best for files, but I haven't tried that yet. Hope it helps.

I encountered this situation with large files in my research work.
As far as I remember there is no much gain in threading the disk I/O work because is
very slow compared to the GPU I/O.
The strategy I used was to read synchronously from disk and to load data and execute asynchronously on the GPU.
Something like:
read from disk
loop:
async_load_to_gpu
async_execute
push_event
read from disk
check event complete or read more data from disk

Related

How to achieve more efficient file writing in C++? Threads, buffers, memory mapped files?

I'm working on a new project (a game engine for self education) and trying to create a logging system. I want the logger to help with debugging as much as possible, so I plan on using it a lot to write to a log file. The only issue is that I'm worried doing file I/O will slow down the game loop which needs to operate within a time bound. What is the best way I can write to a file with minimal risk of slowing down the important section?
I have thought about using threads, but I'm worried that the overhead of context switched due to the process scheduler may be even more of an impediment to performance.
I have considered writing to a buffer and occasionally doing a large dump to the file, but I have read that this can potentially be even slower than regular file writing if the buffer becomes too big. Is it feasible to keep the whole buffer in memory and only write all the contents to the file at once at the end of the program?
I have read lightly about using a memory mapped file, but I've also read that it requires the boost library to be done effectively. I'd like to minimize the dependencies, so ideally I wouldn't use boost. I'm also not entirely sure that my concept of memory mapped files is correct. From what I understand, it behaves as if you are simply writing to memory, but eventually the memory contents will be written to the file. Is this conception correct?
Thanks for reading all of this :)
TL;DR - How can I implement a logging system that minimizes the performance decrease of my program?

If you decide to write everything to memory and at the end write the whole logs to the file, then if any application crash will wipe away all the debug data.
About the memory mapped file, you are write. But you have to consider when the in-memory pages will be written to the disk.

You can use from Ipc methods and separate the logger process from main process and these two process communicate with each other via a queue. main process put the message in queue and logger process get the message and write to file.

Reading many files in parallel

I have a cross platform project in C++ where I am mixing audio in real-time. I have several independent tracks as input, that I read from separate files on disk. I then mix these, apply some processing, and spit out a buffer with the resulting audio. The problem I am having is that of disk IO speed. For the current test I am performing, I have about 10 tracks that are read simultaneously from disk. Each track is in raw PCM, 48000 HZ 16 bit stereo. This means that there is a significant amount of data that needs to be read as quickly as possible. I have tried both simple fread calls as well as memory mapped files through Boost, but the issue is the same. When a file is first opened, it usually causes the audio to break up (presumably while the file is read into cache by the OS). After that, everything runs smoothly without a glitch. For the time being I use one thread per file in the common case, sometimes two files per thread. It is usually when I have two files per thread that the stalling/breakup of the stream occurs. Note that I do not know in advance which input files that need to be played, as this is controlled by the user. So my problem is how to read these initial blocks in such a way so that I don't get stalling/breakup. Also, when a new file is loaded it is not necessarily in the beginning that the reading must start.
I have a few thoughts:
Can we prefetch the files into cache by reading all of them once at startup but disregarding the data? I cannot store all of it in memory. But it seems bad to rely on internal behavior of the OS's read cashe, especially since this is cross platform.
Can we use a format such as Ogg Vorbis for compression, load the compressed data fully into memory and then decode on the fly? I am thinking that decoding 10 or more Vorbis streams might be too CPU intensive, but I have no benchmarks yet. At least in this way we turn it from an I/O bound task to a CPU bound one.
Can we do any other kind of clever buffering approach to make it so that the large reads are more equally distributed? I know very little about how I might accomplish that.
I am stuck at this point, and would appreciate any suggestions that might improve throughput.

Try doing the file loading using event processing.
This is where you open a bunch of file descriptors and let the operating system notify your programs when data is available.
The most broadly available api to do this with "select" (http://linux.die.net/man/2/select), wbut there are better methods ( poll, epoll, kqueue ). These are not available everywhere.
There are libraries that abstract this for you ( libev and libevent ).
So the way you do it is, one thread opens all the files you need and sets a 'watcher' on them. When data is available the watcher triggers, and call a callback.
The advantage is that you don't have a ton of threads waiting and sleeping checking all your open file descriptors. If that doesn't work then likely you are over saturating the hardware's io bandwidth - in which case you just have to wait. If that is the case then you need to do some buffering to avoid stutters.

As a rule of thumb, you need to perform file IO operations in a separate thread for real time operations. When user wants to mix a second audio file, you can just open a new thread and read the first N bytes of this second audio file and return the data read into the main thread. This will also cause a lag but it will not break the audio flow.

What does "Disk Profiling" mean (related to hard disks)? [duplicate]

Currently I am working on a MFC application which reads and writes in to the disk. Sometimes this application runs amazingly fast and sometimes it is damn slow. I am guessing that it is because of the disk access involved, hence I want to profile it. These are some questions in this regard:
(1).Currently I am using AQTime profiler to profile the application. Has anybody tried profiling disk access using this? or is there any other tool available which I can use?
(2). What are the most important disk parameters I should be looking at?
(3). If I have multiple threads trying to read and write the data from disk does it affect the performance? i.e. am I better off having a single threaded access to the disk?

You can use the Windows Performance Toolkit for this. You can enable trace providers for disk I/O events and see the I/O time and disk service time for each. It does have a bit of a learning curve though. This will also let you determine which file I/O's actually result in real-access to the disk and aren't handled by the cache manager.
Most important parameters are disk service time and queue length. Disk service time is how long the disk actually took to service the request. Queue length indicates if your disk request is backed up behind other requests.
For many threads w/ reads & writes - Many disks have poor performance in the face of reads with background writes. If you have various threads doing lots of disk I/O to random locations on the disk, you may wind up starving certain requests.

To help you with (2):
Try to batch up your writes to disk to avoid many small calls to write. When you're done flushing your buffer, call commit. commit (aka fsync) is an expensive operation, so becomes even more so when there are lots of small writes.
On windows file handles you can experiment with FILE FLAG WRITE THROUGH to increase write speeds. Supposedly commit doesn't have to be called with handles using this flag.
If data you are writing to disk will also be accessed through reading, consider writing to an in memory structure first, having another thread read from the structure to write it to disk. This will help avoid calls to read data from disk that you have just written.
Hopefully this helps....

What I would do is, if you can't pause all threads at the same time and examine their state, focus on one of them and pause that, while it's being "damn slow". This is a little known but effective technique.
Since it is being extremely slow compared to what it could be, whatever it is waiting for it is waiting for probably 99% of the time, so when you pause it you will see it. That's true whether it's one big wait, or a zillion little ones. Look at the whole call stack. The culprit may be somewhere in the middle of the stack.
If you're not sure, pause it two or three times. The culprit will be on all stack samples.

What is the Fastest Method for High Performance Sequential File I/O in C++?

Assuming the following for...
Output:
The file is opened...
Data is 'streamed' to disk. The data in memory is in a large contiguous buffer. It is written to disk in its raw form directly from that buffer. The size of the buffer is configurable, but fixed for the duration of the stream. Buffers are written to the file, one after another. No seek operations are conducted.
...the file is closed.
Input:
A large file (sequentially written as above) is read from disk from beginning to end.
Are there generally accepted guidelines for achieving the fastest possible sequential file I/O in C++?
Some possible considerations:
Guidelines for choosing the optimal buffer size
Will a portable library like boost::asio be too abstracted to expose the intricacies of a specific platform, or can they be assumed to be optimal?
Is asynchronous I/O always preferable to synchronous? What if the application is not otherwise CPU-bound?
I realize that this will have platform-specific considerations. I welcome general guidelines as well as those for particular platforms.
(my most immediate interest in Win x64, but I am interested in comments on Solaris and Linux as well)

Are there generally accepted guidelines for achieving the fastest possible sequential file I/O in C++?
Rule 0: Measure. Use all available profiling tools and get to know them. It's almost a commandment in programming that if you didn't measure it you don't know how fast it is, and for I/O this is even more true. Make sure to test under actual work conditions if you possibly can. A process that has no competition for the I/O system can be over-optimized, fine-tuned for conditions that don't exist under real loads.
Use mapped memory instead of writing to files. This isn't always faster but it allows the opportunity to optimize the I/O in an operating system-specific but relatively portable way, by avoiding unnecessary copying, and taking advantage of the OS's knowledge of how the disk actually being used. ("Portable" if you use a wrapper, not an OS-specific API call).
Try and linearize your output as much as possible. Having to jump around memory to find the buffers to write can have noticeable effects under optimized conditions, because cache lines, paging and other memory subsystem issues will start to matter. If you have lots of buffers look into support for scatter-gather I/O which tries to do that linearizing for you.
Some possible considerations:
Guidelines for choosing the optimal buffer size
Page size for starters, but be ready to tune from there.
Will a portable library like boost::asio be too abstracted to expose the intricacies
of a specific platform, or can they be assumed to be optimal?
Don't assume it's optimal. It depends on how thoroughly the library gets exercised on your platform, and how much effort the developers put into making it fast. Having said that a portable I/O library can be very fast, because fast abstractions exist on most systems, and it's usually possible to come up with a general API that covers a lot of the bases. Boost.Asio is, to the best of my limited knowledge, fairly fine tuned for the particular platform it is on: there's a whole family of OS and OS-variant specific APIs for fast async I/O (e.g. epoll, /dev/epoll, kqueue, Windows overlapped I/O), and Asio wraps them all.
Is asynchronous I/O always preferable to synchronous? What if the application is not otherwise CPU-bound?
Asynchronous I/O isn't faster in a raw sense than synchronous I/O. What asynchronous I/O does is ensure that your code is not wasting time waiting for the I/O to complete. It is faster in a general way than the other method of not wasting that time, namely using threads, because it will call back into your code when I/O is ready and not before. There are no false starts or concerns with idle threads needing to be terminated.

A general advice is to turn off buffering and read/write in large chunks (but not too large, then you will waste too much time waiting for the whole I/O to complete where otherwise you could start munching away at the first megabyte already. It's trivial to find the sweet spot with this algorithm, there's only one knob to turn: the chunk size).
Beyond that, for input mmap()ing the file shared and read-only is (if not the fastest, then) the most efficient way. Call madvise() if your platform has it, to tell the kernel how you will traverse the file, so it can do readahead and throw out the pages afterwards again quickly.
For output, if you already have a buffer, consider underpinning it with a file (also with mmap()), so you don't have to copy the data in userspace.
If mmap() is not to your liking, then there's fadvise(), and, for the really tough ones, async file I/O.
(All of the above is POSIX, Windows names may be different).

For Windows, you'll want to make sure you use the FILE_FLAG_SEQUENTIAL_SCAN in your CreateFile() call, if you opt to use the platform specific Windows API call. This will optimize caching for the I/O. As far as buffer sizes go, a buffer size that is a multiple of the disk sector size is typically advised. 8K is a nice starting point with little to be gained from going larger.
This article discusses the comparison between async and sync on Windows.
http://msdn.microsoft.com/en-us/library/aa365683(VS.85).aspx

As you noted above it all depends on the machine / system / libraries that you are using. A fast solution on one system may be slow on another.
A general guideline though would be to write in as large of chunks as possible.Typically writing a byte at a time is the slowest.
The best way to know for sure is to code a few different ways and profile them.

You asked about C++, but it sounds like you're past that and ready to get a little platform-specific.
On Windows, FILE_FLAG_SEQUENTIAL_SCAN with a file mapping is probably the fastest way. In fact, your process can exit before the file actually makes it on to the disk. Without an explicitly-blocking flush operation, it can take up to 5 minutes for Windows to begin writing those pages.
You need to be careful if the files are not on local devices but a network drive. Network errors will show up as SEH errors, which you will need to be prepared to handle.
On *nixes, you might get a bit higher performance writing sequentially to a raw disk device. This is possible on Windows too, but not as well supported by the APIs. This will avoid a little filesystem overhead, but it may not amount to enough to be useful.
Loosely speaking, RAM is 1000 or more times faster than disks, and CPU is faster still. There are probably not a lot of logical optimizations that will help, except avoiding movements of the disk heads (seek) whenever possible. A dedicated disk just for this file can help significantly here.

You will get the absolute fastest performance by using CreateFile and ReadFile. Open the file with FILE_FLAG_SEQUENTIAL_SCAN.
Read with a buffer size that is a power of two. Only benchmarking can determine this number. I have seen it to be 8K once. Another time I found it to be 8M! This varies wildly.
It depends on the size of the CPU cache, on the efficiency of OS read-ahead and on the overhead associated with doing many small writes.
Memory mapping is not the fastest way. It has more overhead because you can't control the block size and the OS needs to fault in all pages.

On Linux, buffered reads and writes speed up things a lot up, increasingly with increasing buffers sizes, but the returns are diminishing and you generally want to use BUFSIZ (defined by stdio.h) as larger buffer sizes won't help much.
mmaping provides the fastest access to files, but the mmap call itself is rather expensive. For small files (16KiB) read and write system calls win (see https://stackoverflow.com/a/39196499/1084774 for the numbers on reading through read and mmap).

Profiling disk access

Currently I am working on a MFC application which reads and writes in to the disk. Sometimes this application runs amazingly fast and sometimes it is damn slow. I am guessing that it is because of the disk access involved, hence I want to profile it. These are some questions in this regard:
(1).Currently I am using AQTime profiler to profile the application. Has anybody tried profiling disk access using this? or is there any other tool available which I can use?
(2). What are the most important disk parameters I should be looking at?
(3). If I have multiple threads trying to read and write the data from disk does it affect the performance? i.e. am I better off having a single threaded access to the disk?

You can use the Windows Performance Toolkit for this. You can enable trace providers for disk I/O events and see the I/O time and disk service time for each. It does have a bit of a learning curve though. This will also let you determine which file I/O's actually result in real-access to the disk and aren't handled by the cache manager.
Most important parameters are disk service time and queue length. Disk service time is how long the disk actually took to service the request. Queue length indicates if your disk request is backed up behind other requests.
For many threads w/ reads & writes - Many disks have poor performance in the face of reads with background writes. If you have various threads doing lots of disk I/O to random locations on the disk, you may wind up starving certain requests.

To help you with (2):
Try to batch up your writes to disk to avoid many small calls to write. When you're done flushing your buffer, call commit. commit (aka fsync) is an expensive operation, so becomes even more so when there are lots of small writes.
On windows file handles you can experiment with FILE FLAG WRITE THROUGH to increase write speeds. Supposedly commit doesn't have to be called with handles using this flag.
If data you are writing to disk will also be accessed through reading, consider writing to an in memory structure first, having another thread read from the structure to write it to disk. This will help avoid calls to read data from disk that you have just written.
Hopefully this helps....

What I would do is, if you can't pause all threads at the same time and examine their state, focus on one of them and pause that, while it's being "damn slow". This is a little known but effective technique.
Since it is being extremely slow compared to what it could be, whatever it is waiting for it is waiting for probably 99% of the time, so when you pause it you will see it. That's true whether it's one big wait, or a zillion little ones. Look at the whole call stack. The culprit may be somewhere in the middle of the stack.
If you're not sure, pause it two or three times. The culprit will be on all stack samples.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js