how to design a real time CUDA financial application

how to design a real time CUDA financial application - c++

I have a general question about how to design my application. I have read the Cuda document, but still don't know what I should look into. Really appreciate it if someone could shed a light on it.
I want to do some real time analytics about stocks, say 100 stocks. And I have real time market data feed which will stream with updated market price. What I want to do are:
pre-allocate memory black for each stock on the cuda card, and keep the memory during the day time.
when new data coming in, directly update the corresponding memory on Cuda card.
After updating, it issue signal or trigger event to start analytical calculation.
When calculation is done, write the result back to CPU memory.
Here are my questions:
what's the most efficient way to stream data from CPU memory to GPU memory? Because I want it in real time, so copying memory snapshot from CPU to GPU every second is not acceptable.
I may need to allocate memory block for 100 stocks both on CPU and GPU. How to mapping the CPU memory cell to each GPU memory cell?
How to trigger the analytics calculation when the new data arrive on Cuda card?
I am using a Tesla C1060 with Cuda 3.2 on Windows XP.
Thank you very much for any suggestion.

There is nothing unusual in your requirements.
You can keep information in GPU memory as long as your application is running, and do small updates to keep the data in sync with what you have on the CPU. You can allocate your GPU memory with cudaMalloc() and use cudaMemcpy() to write updated data into sections of the allocated memory. Or, you can hold data in a Thrust structure, such as a thrust::device_vector. When you update the device_vector, CUDA memory copies are done in the background.
After you have updated the data, you simply rerun your kernel(s) to get updated results for your calculation.
Could you expand on question (2)?

Related

Does cudaMallocManaged() create a synchronized buffer in RAM and VRAM?

In an Nvidia developer blog: An Even Easier Introduction to CUDA the writer explains:
To compute on the GPU, I need to allocate memory accessible by the
GPU. Unified Memory in CUDA makes this easy by providing a single
memory space accessible by all GPUs and CPUs in your system. To
allocate data in unified memory, call cudaMallocManaged(), which
returns a pointer that you can access from host (CPU) code or device
(GPU) code.
I found this both interesting (since it seems potentially convenient) and confusing:
returns a pointer that you can access from host (CPU) code or device
(GPU) code.
For this to be true, it seems like cudaMallocManaged() must be syncing 2 buffers across VRAM and RAM. Is this the case? Or is my understanding lacking?
In my work so far with GPU acceleration on top of the WebGL abstraction layer via GPU.js, I learned the distinct performance difference between passing VRAM based buffers (textures in WebGL) from kernel to kernel (keeping the buffer on the GPU, highly performant) and retrieving the buffer value outside of the kernels to access it in RAM through JavaScript (pulling the buffer off the GPU, taking a performance hit since buffers in VRAM on the GPU don't magically move to RAM).
Forgive my highly abstracted understanding / description of the topic, since I know most CUDA / C++ devs have a much more granular understanding of the process.
So is cudaMallocManaged() creating synchronized buffers in both RAM
and VRAM for convenience of the developer?
If so, wouldn't doing so come with an unnecessary cost in cases where
we might never need to touch that buffer with the CPU?
Does the compiler perhaps just check if we ever reference that buffer
from CPU and never create the CPU side of the synced buffer if it's
not needed?
Or do I have it all wrong? Are we not even talking VRAM? How does
this work?

So is cudaMallocManaged() creating synchronized buffers in both RAM and VRAM for convenience of the developer?
Yes, more or less. The "synchronization" is referred to in the managed memory model as migration of data. Virtual address carveouts are made for all visible processors, and the data is migrated (i.e. moved to, and provided a physical allocation for) the processor that attempts to access it.
If so, wouldn't doing so come with an unnecessary cost in cases where we might never need to touch that buffer with the CPU?
If you never need to touch the buffer on the CPU, then what will happen is that the VA carveout will be made in the CPU VA space, but no physical allocation will be made for it. When the GPU attempts to actually access the data, it will cause the allocation to "appear" and use up GPU memory. Although there are "costs" to be sure, there is no usage of CPU (physical) memory in this case. Furthermore, once instantiated in GPU memory, there should be no ongoing additional cost for the GPU to access it; it should run at "full" speed. The instantiation/migration process is a complex one, and what I am describing here is what I would consider the "principal" modality or behavior. There are many factors that could affect this.
Does the compiler perhaps just check if we ever reference that buffer from CPU and never create the CPU side of the synced buffer if it's not needed?
No, this is managed by the runtime, not compile time.
Or do I have it all wrong? Are we not even talking VRAM? How does this work?
No you don't have it all wrong. Yes we are talking about VRAM.
The blog you reference barely touches on managed memory, which is a fairly involved subject. There are numerous online resources to learn more about it. You might want to review some of them. here is one. There are good GTC presentations on managed memory, including here. There is also an entire section of the CUDA programming guide covering managed memory.

How to calculate the largest possible float array I can copy to GPU? [duplicate]

I'm writing a server process that performs calculations on a GPU using cuda. I want to queue up in-coming requests until enough memory is available on the device to run the job, but I'm having a hard time figuring out how much memory I can allocate on the the device. I have a pretty good estimate of how much memory a job requires, (at least how much will be allocated from cudaMalloc()), but I get device out of memory long before I've allocated the total amount of global memory available.
Is there some king of formula to compute from the total global memory the amount I can allocated? I can play with it until I get an estimate that works empirically, but I'm concerned my customers will deploy different cards at some point and my jerry-rigged numbers won't work very well.

The size of your GPU's DRAM is an upper bound on the amount of memory you can allocate through cudaMalloc, but there's no guarantee that the CUDA runtime can satisfy a request for all of it in a single large allocation, or even a series of small allocations.
The constraints of memory allocation vary depending on the details of the underlying driver model of the operating system. For example, if the GPU in question is the primary display device, then it's possible that the OS has also reserved some portion of the GPU's memory for graphics. Other implicit state the runtime uses (such as the heap) also consumes memory resources. It's also possible that the memory has become fragmented and no contiguous block large enough to satisfy the request exists.
The CUDART API function cudaMemGetInfo reports the free and total amount of memory available. As far as I know, there's no similar API call which can report the size of the largest satisfiable allocation request.

Fast memory allocation for real time data acquisition

I have a range of sensors connected to a PC that measure various physical parameters, like force, rotational speed and temperature. These sensors continuously produce samples at some sample rate. A sample consists of a timestamp and the measured dimension itself; the sample rates are in magnitudes of single-digit kilohertz (i.e., somewhere between 1 and 9000 samples per second).
The PC is supposed to read and store these samples during a given period of time. Afterwards the collected data is further treated and evaluated.
What would be a sensible way to buffer the samples? At some realistic setup the acquisition could easily gather a couple of megabytes per second. Also paging could be critical in case memory is allocated fast but needs swapping upon write.
I could think of a threaded approach where a separate thread allocates and manages a pool of (locked, so non-swappable) memory chunks. Given there are always enough of these chunks pre-allocated, further allocation would only block (in case other processes' pages have to be swapped out before) this memory pool's thread and the acquisition could proceed without interruption.
This basically is a conceptual question. Yet, to be more specific:
It should only rely on portable features, like POSIX. Features out Qt's universe is fine, too.
The sensors can be interfaced in various ways. IP is one possibility. Usually the sensors are directly connected to the PC via local links (RS232, USB, extension cards and such). That is, fast enough.
The timestamps are mostly applied by the acquisition hardware itself if it is capable in doing so, to avoid jitter over network etc.
Thinking it over
Should I really worry? Apparently the problem diverts into three scenarios:
There is only little data collected at all. It can easily be buffered in one large pre-allocated buffer.
Data is collected slowly. Allocating the buffers on the fly is perfectly fine.
There is so much data acquired at high sample rates. Then allocation is not the problem because the buffer will eventually overflow anyway. The problem is rather how to transfer the data from the memory buffer to permanent storage fast enough.

The idea for solving this type of problems can be as follows:
Separate the problem into 2 or more processes depending what you need to do with your data:
Acquirer
Analyzer (if you want to process data in real time)
Writer
Store data in a circular buffer in shared memory (I recommend using boost::interprocess).
Acquirer will continuously read data from the device and store it in a shared memory. In the meantime, once is enough data read for doing any analysis, the Analyzer will start processing it. It can store results into another circular buffer shared memory if needed. Also in the meantime Reader will read the data from shared memory (acquired or already processed) and store it in the output file.
You need to make sure all the processes are synchronized properly so that they do their job simultaneously and you don't lose the data (the data is not being overwritten before is processed or saved into output file).

Check available memory in the system for new allocations

I'm working in a Windows C++ application to work with point clouds. We use the PCL library along with Qt and OpenSceneGraph. The computer has 4 GB of RAM.
If we load a lot of points (for example, 40 point clouds have around 800 million points in total) the system goes crazy.
The app is almost unresponsive (it takes ages to move the mouse around it and the arrow changes to a circle that keeps spinning) and in the task manager, in the Performance tab, I got this output:
Memory (1 in the picture): goes up to 3,97 GB, almost the total of the system.
Free (2 in the picture): 0
I have checked this posts: here and here and with the MEMORYSTATUSEX version, I got the memory info.
The idea here is, before loading more clouds, check the memory available. If the "weight" of the cloud that we're gonna load is bigger than the available memory don't load it, so the app won't freeze and the user has the chance to remove older clouds to free some memory. It's worth to note that no exceptions are thrown, the worst scenario I got was that Windows killed the app itself, when the memory was insufficient.
Now, is this a good idea? Is there a canonical way to deal with this thing?
I would be glad to hear your thoughts on this matter.

Your view is from a different direction from the usual approach to similar problems.
Normally, one would probably allocate then attempt to lock in physical memory the space they needed. (mlock() in POSIX, VirtualLock() in WinAPI). The reasoning is that even if the system has enough available physical memory at the moment, some other process could spawn the next moment and push part of your resident set into swap.
This will require you to use a custom allocator as well as ensure that your process has permission to lock down the required number of pages.
Read here for a start on this: http://msdn.microsoft.com/en-us/library/windows/desktop/aa366895(v=vs.85).aspx

You are also likely running into memory issues with your graphics card even once the points are loaded. You should probably monitor that as well. Once your loaded points clouds exceed your dedicated graphics card memory (which they almost certainly are in this case) the rendering slows to a crawl.
800 million is also an immense amount of points. With a minimum 3 floats per point (assuming no colorization) you are talking about 9.6GB of points so you are swapping like crazy.
I generally start voxeling to reduce memory usage once I get beyond 30-40 million points.

This is more complicated than you might imagine. The available memory shown in the system display is physical memory. The amount of memory available to your application is virtual memory.
The physical memory is shared by all processes on the computer. If you have something else running at the same time.
-=-=-=--=-=
I suspect that the problem you are seeing is processing. Using half the memory on an 4GB system should be no big deal.
If you are doing lengthy calculations do you give the system a chance to process accumulated events?
That is what I suspect the real problem is.

Using Nsight to determine bank conflicts and coalescing

How can i know the number of non Coalesced read/write and bank conflicts using parallel nsight?
Moreover what should i look at when i use nsight is a profiler? what are the important fields that may cause my program to slow down?

I don't use NSight, but typical fields that you'll look at with a profiler are basically:
memory consumption
time spent in functions
More specifically, with CUDA, you'll be careful to your GPU's occupancy.
Other interesting values are the way the compiler has set your local variables: in registers or in local memory.
Finally, you'll check the time spent to transfer data to and back from the GPU, and compare it with the computation time.

For bank conflicts, you need to watch warp serialization. See here.
And here is a discussion about monitoring memory coalescence <-- basically you just need to watch Global Memory Loads/Stores - Coalesced/Uncoalesced and flag the Uncoalesced.

M. Tibbits basically answered what you need to know for bank conflicts and non-coalesced memory transactions.
For the question on what are the important fields/ things to look at (when using the Nsight profiler) that may cause my program to slow down:
Use Application or System Trace to determine if you are CPU bound, memory bound, or kernel bound. This can be done by looking at the Timeline.
a. CPU bound – you will see large areas where no kernel or memory copy is occurring but your application threads (Thread State) is Green
b. Memory bound – kernels execution blocked on memory transfers to or from the device. You can see this by looking at the Memory Row. If you are spending a lot of time in Memory Copies then you should consider using CUDA streams to pipeline your application. This can allow you to overlap memory transfers and kernels. Before changing your code you should compare the duration of the transfers and kernels and make sure you will get a performance gain.
c. Kernel bound – If the majority of the application time is spent waiting on kernels to complete then you should switch to the "Profile" activity, re-run your application, and start collecting hardware counters to see how you can make your kernel's actual execution time faster.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js