I was testing the new CUDA 8 along with the Pascal Titan X GPU and is expecting speed up for my code but for some reason it ends up being slower. I am on Ubuntu 16.04.
Here is the minimum code that can reproduce the result:
CUDASample.cuh
class CUDASample{
public:
void AddOneToVector(std::vector<int> &in);
};
CUDASample.cu
__global__ static void CUDAKernelAddOneToVector(int *data)
{
const int x = blockIdx.x * blockDim.x + threadIdx.x;
const int y = blockIdx.y * blockDim.y + threadIdx.y;
const int mx = gridDim.x * blockDim.x;
data[y * mx + x] = data[y * mx + x] + 1.0f;
}
void CUDASample::AddOneToVector(std::vector<int> &in){
int *data;
cudaMallocManaged(reinterpret_cast<void **>(&data),
in.size() * sizeof(int),
cudaMemAttachGlobal);
for (std::size_t i = 0; i < in.size(); i++){
data[i] = in.at(i);
}
dim3 blks(in.size()/(16*32),1);
dim3 threads(32, 16);
CUDAKernelAddOneToVector<<<blks, threads>>>(data);
cudaDeviceSynchronize();
for (std::size_t i = 0; i < in.size(); i++){
in.at(i) = data[i];
}
cudaFree(data);
}
Main.cpp
std::vector<int> v;
for (int i = 0; i < 8192000; i++){
v.push_back(i);
}
CUDASample cudasample;
cudasample.AddOneToVector(v);
The only difference is the NVCC flag, which for the Pascal Titan X is:
-gencode arch=compute_61,code=sm_61-std=c++11;
and for the old Maxwell Titan X is:
-gencode arch=compute_52,code=sm_52-std=c++11;
EDIT: Here are the results for running NVIDIA Visual Profiling.
For the old Maxwell Titan, the time for memory transfer is around 205 ms, and the kernel launch is around 268 us.
For the Pascal Titan, the time for memory transfer is around 202 ms, and the kernel launch is around an insanely long 8343 us, which makes me believe something is wrong.
I further isolate the problem by replacing cudaMallocManaged into good old cudaMalloc and did some profiling and observe some interesting result.
CUDASample.cu
__global__ static void CUDAKernelAddOneToVector(int *data)
{
const int x = blockIdx.x * blockDim.x + threadIdx.x;
const int y = blockIdx.y * blockDim.y + threadIdx.y;
const int mx = gridDim.x * blockDim.x;
data[y * mx + x] = data[y * mx + x] + 1.0f;
}
void CUDASample::AddOneToVector(std::vector<int> &in){
int *data;
cudaMalloc(reinterpret_cast<void **>(&data), in.size() * sizeof(int));
cudaMemcpy(reinterpret_cast<void*>(data),reinterpret_cast<void*>(in.data()),
in.size() * sizeof(int), cudaMemcpyHostToDevice);
dim3 blks(in.size()/(16*32),1);
dim3 threads(32, 16);
CUDAKernelAddOneToVector<<<blks, threads>>>(data);
cudaDeviceSynchronize();
cudaMemcpy(reinterpret_cast<void*>(in.data()),reinterpret_cast<void*>(data),
in.size() * sizeof(int), cudaMemcpyDeviceToHost);
cudaFree(data);
}
For the old Maxwell Titan, the time for memory transfer is around 5 ms both ways, and the kernel launch is around 264 us.
For the Pascal Titan, the time for memory transfer is around 5 ms both ways, and the kernel launch is around 194 us, which actually results in the performance increase I am hoping to see...
Why is Pascal GPU so slow on running CUDA kernels when cudaMallocManaged is used? It will be a travesty if I have to revert all my existing code that uses cudaMallocManaged into cudaMalloc. This experiment also shows that the memory transfer time using cudaMallocManaged is a lot slower than using cudaMalloc, which also feels like something is wrong. If using this results in a slow run time even the code is easier, this should be unacceptable because the whole purpose of using CUDA instead of plain C++ is to speed things up. What am I doing wrong and why am I observing this kind of result?
Under CUDA 8 with Pascal GPUs, managed memory data migration under a unified memory (UM) regime will generally occur differently than on previous architectures, and you are experiencing the effects of this. (Also see note at the end about CUDA 9 updated behavior for windows.)
With previous architectures (e.g. Maxwell), managed allocations used by a particular kernel call will be migrated all at once, upon launch of the kernel, approximately as if you called cudaMemcpy to move the data yourself.
With CUDA 8 and Pascal GPUs, data migration occurs via demand-paging. At kernel launch, by default, no data is explicitly migrated to the device(*). When the GPU device code attempts to access data in a particular page that is not resident in GPU memory, a page fault will occur. The net effect of this page fault is to:
Cause the GPU kernel code (the thread or threads that accessed the page) to stall (until step 2 is complete)
Cause that page of memory to be migrated from the CPU to the GPU
This process will be repeated as necessary, as GPU code touches various pages of data. The sequence of operations involved in step 2 above involves some latency as the page fault is processed, in addition to the time spent to actually move the data. Since this process will move data a page at a time, it may be signficantly less efficient than moving all the data at once, either using cudaMemcpy or else via the pre-Pascal UM arrangement that caused all data to be moved at kernel launch (whether it was needed or not, and regardless of when the kernel code actually needed it).
Both approaches have their pros and cons, and I don't wish to debate the merits or various opinions or viewpoints. The demand-paging process enables a great many important features and capabilities for Pascal GPUs.
This particular code example, however, does not benefit. This was anticipated, and so the recommended use to bring the behavior in line with previous (e.g. maxwell) behavior/performance is to precede the kernel launch with a cudaMemPrefetchAsync() call.
You would use the CUDA stream semantics to force this call to complete prior to the kernel launch (if the kernel launch does not specify a stream, you can pass NULL for the stream parameter, to select the default stream). I believe the other parameters for this function call are pretty self-explanatory.
With this function call before your kernel call, covering the data in question, you should not observe any page-faulting in the Pascal case, and the profile behavior should be similar to the Maxwell case.
As I mentioned in the comments, if you had created a test case that involved two kernel calls in sequence, you would have observed that the 2nd call runs at approximately full speed even in the Pascal case, since all of the data has already been migrated to the GPU side through the first kernel execution. Therefore, the use of this prefetch function should not be considered mandatory or automatic, but should be used thoughtfully. There are situations where the GPU may be able to hide the latency of page-faulting to some degree, and obviously data already resident on the GPU does not need to be prefetched.
Note that the "stall" referred to in step 1 above is possibly misleading. A memory access by itself does not trigger a stall. But if the data requested is actually needed for an operation, e.g. a multiply, then the warp will stall at the multiply operation, until the necessary data becomes available. A related point, then, is that demand-paging of data from host to device in this fashion is just another "latency" that the GPU can possibly hide in it's latency-hiding architecture, if there is sufficient other available "work" to attend to.
As an additional note, in CUDA 9, the demand-paging regime for pascal and beyond is only available on linux; the previous support for Windows advertised in CUDA 8 has been dropped. See here. On windows, even for Pascal devices and beyond, as of CUDA 9, the UM regime is the same as maxwell and prior devices; data is migrated to the GPU en-masse, at kernel launch.
(*) The assumption here is that data is "resident" on the host, i.e. already "touched" or initialized in CPU code, after the managed allocation call. The managed allocation itself creates data pages associated with the device, and when CPU code "touches" these pages, the CUDA runtime will demand-page the necessary pages to be resident in host memory, so that the CPU can use them. If you perform an allocation but never "touch" the data in CPU code (an odd situation, probably) then it will actually already be "resident" in device memory when the kernel runs, and the observed behavior will be different. But that is not the case in view for this particular example/question.
Additional information is available in this blog article.
I can reproduce this in three programms on a 1060 and a 1080. As example i use a voulme render with procedural transferfunction which was nearly interactive real time on a 960 but on a 1080 is a slight show. All data are stored in read only textures and only my transferfunctions are in Managed Memory. In difference to my other code the volume render runs especially slow, this is becaus in differece to my other code my transferfunctions are passed from the kernel to other device methods.
I belive that it is not only the calling of kernels with cudaMallocManaged data. My expierence go to that every call of a kernel or device methode has this behavior and the effect adds up. Also the basis of the volume render is in parts the provided CudaSample without Managed Memory, which runs as expected on Maxwell an pascal GPUs (1080, 1060,980Ti,980,960).
I just yesterday found this bug, because we changed all of oure reaserch systems to pascal. I will profile my software in the next days on a 980 in comapre to a 1080. I'm not yet sure if i should report a bug in the NVIDIA developer zone.
it is a BUG of NVIDIA on Windows Systems witch occurs with PASCAL architecture.
I know this since a few days, but could not write it here because i was on vacation without internet connection.
For details see the comments of: https://devblogs.nvidia.com/parallelforall/unified-memory-cuda-beginners/
where Mark Harris from NVIDIA confirms the Bug. It should be corrected with CUDA 9. He also tells that it should be communicated to Microsoft to help the caus. But i don't found a suitable Microsoft Bug Report Page till now.
Related
I was trying to do the Fast Fourier Transform to the data I collected. After the FFT operation, I wanted to calculate the modulus of the cufftComplex type data. Therefore, I summed the real part square and imaginary part square, and then took the square root of the summation. The code are provided below also the assignment of the grids and blocks:
dim3 dimBlock(256);
dim3 dimGrid(FFTlength / 256 * lines);
__global__ void modulus_kernel(int length, int lines, cufftComplex *PostFFTData, float* z)
{
unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;
if(x<length*lines)
z1[x] = sqrt(PostFFTData[x].x *PostFFTData[x].x + PostFFTData[x].y *PostFFTData[x].y);
__syncthreads();
}
The length of the PostFFData pointer array is 1024000, and the length and lines are 2048 and 500 respectively.
After I executed the code, I analyzed the timeline of the program by Nvidia Visual Profiler.
It shows that the modulus kernel took 0.367 ms to complete. Besides, the GPU card I used is GTX1080 and the CPU is i7-7700U. If I want to shorten the execution time, how should I do it?
If I want to shorten the execution time, how should I do it?
I can think of at least five things (in no particular order)
Get rid of the __syncthreads() call. It is unnecessary and will actively slow down your code
Pass the kernel length*lines as a single argument to the kernel. Why have an every thread do an integer multiply for a value which is constant?
Use a grid stride loop and launch only as many threads as can be resident on the device. Use the occupancy APIs to let the runtime do the hard thinking about the launch parameters for you.
If the problem size allows use #pragma unroll with a suggested unrolling length to hint to the compiler that the gride size loop can be partially unrolled. If that doesn't allow the compiler to generate a stream of floating point operations, then partially unroll the grid sized loop yourself.
Because you are passing single precision floating point values, use sqrtf, not sqrt. There are significant performance differences between double and single precision functions. If your application allows it, consider using less accurate versions of the sqrt function (prec-sqrt=false)
__syncthreads();
is useless since there is no sharing between threads
I am aware of multiple questions on this topic, however, I haven't seen any clear answers nor any benchmark measurements. I thus created a simple program that works with two arrays of integers. The first array a is very large (64 MB) and the second array b is small to fit into L1 cache. The program iterates over a and adds its elements to corresponding elements of b in a modular sense (when the end of b is reached, the program starts from its beginning again). The measured numbers of L1 cache misses for different sizes of b is as follows:
The measurements were made on a Xeon E5 2680v3 Haswell type CPU with 32 kiB L1 data cache. Therefore, in all the cases, b fitted into L1 cache. However, the number of misses grew considerably by around 16 kiB of b memory footprint. This might be expected since the loads of both a and b causes invalidation of cache lines from the beginning of b at this point.
There is absolutely no reason to keep elements of a in cache, they are used only once. I therefore run a program variant with non-temporal loads of a data, but the number of misses did not change. I also run a variant with non-temporal prefetching of a data, but still with the very same results.
My benchmark code is as follows (variant w/o non-temporal prefetching shown):
int main(int argc, char* argv[])
{
uint64_t* a;
const uint64_t a_bytes = 64 * 1024 * 1024;
const uint64_t a_count = a_bytes / sizeof(uint64_t);
posix_memalign((void**)(&a), 64, a_bytes);
uint64_t* b;
const uint64_t b_bytes = atol(argv[1]) * 1024;
const uint64_t b_count = b_bytes / sizeof(uint64_t);
posix_memalign((void**)(&b), 64, b_bytes);
__m256i ones = _mm256_set1_epi64x(1UL);
for (long i = 0; i < a_count; i += 4)
_mm256_stream_si256((__m256i*)(a + i), ones);
// load b into L1 cache
for (long i = 0; i < b_count; i++)
b[i] = 0;
int papi_events[1] = { PAPI_L1_DCM };
long long papi_values[1];
PAPI_start_counters(papi_events, 1);
uint64_t* a_ptr = a;
const uint64_t* a_ptr_end = a + a_count;
uint64_t* b_ptr = b;
const uint64_t* b_ptr_end = b + b_count;
while (a_ptr < a_ptr_end) {
#ifndef NTLOAD
__m256i aa = _mm256_load_si256((__m256i*)a_ptr);
#else
__m256i aa = _mm256_stream_load_si256((__m256i*)a_ptr);
#endif
__m256i bb = _mm256_load_si256((__m256i*)b_ptr);
bb = _mm256_add_epi64(aa, bb);
_mm256_store_si256((__m256i*)b_ptr, bb);
a_ptr += 4;
b_ptr += 4;
if (b_ptr >= b_ptr_end)
b_ptr = b;
}
PAPI_stop_counters(papi_values, 1);
std::cout << "L1 cache misses: " << papi_values[0] << std::endl;
free(a);
free(b);
}
What I wonder is whether CPU vendors support or are going to support non-temporal loads / prefetching or any other way how to label some data as not-being-hold in cache (e.g., to tag them as LRU). There are situations, e.g., in HPC, where similar scenarios are common in practice. For example, in sparse iterative linear solvers / eigensolvers, matrix data are usually very large (larger than cache capacities), but vectors are sometimes small enough to fit into L3 or even L2 cache. Then, we would like to keep them there at all costs. Unfortunately, loading of matrix data can cause invalidation of especially x-vector cache lines, even though in each solver iteration, matrix elements are used only once and there is no reason to keep them in cache after they have been processed.
UPDATE
I just did a similar experiment on an Intel Xeon Phi KNC, while measuring runtime instead of L1 misses (I haven't find a way how to measure them reliably; PAPI and VTune gave weird metrics.) The results are here:
The orange curve represents ordinary loads and it has the expected shape. The blue curve represents loads with so-call eviction hint (EH) set in the instruction prefix and the gray curve represents a case where each cache line of a was manually evicted; both these tricks enabled by KNC obviously worked as we wanted to for b over 16 kiB. The code of the measured loop is as follows:
while (a_ptr < a_ptr_end) {
#ifdef NTLOAD
__m512i aa = _mm512_extload_epi64((__m512i*)a_ptr,
_MM_UPCONV_EPI64_NONE, _MM_BROADCAST64_NONE, _MM_HINT_NT);
#else
__m512i aa = _mm512_load_epi64((__m512i*)a_ptr);
#endif
__m512i bb = _mm512_load_epi64((__m512i*)b_ptr);
bb = _mm512_or_epi64(aa, bb);
_mm512_store_epi64((__m512i*)b_ptr, bb);
#ifdef EVICT
_mm_clevict(a_ptr, _MM_HINT_T0);
#endif
a_ptr += 8;
b_ptr += 8;
if (b_ptr >= b_ptr_end)
b_ptr = b;
}
UPDATE 2
On Xeon Phi, icpc generated for normal-load variant (orange curve) prefetching for a_ptr:
400e93: 62 d1 78 08 18 4c 24 vprefetch0 [r12+0x80]
When I manually (by hex-editing the executable) modified this to:
400e93: 62 d1 78 08 18 44 24 vprefetchnta [r12+0x80]
I got the desired resutls, even better than the blue/gray curves. However, I was not able to force the compiler to generate non-temporal prefetchnig for me, even by using #pragma prefetch a_ptr:_MM_HINT_NTA before the loop :(
To answer specifically the headline question:
Yes, recent1 mainstream Intel CPUs support non-temporal loads on normal 2 memory - but only "indirectly" via non-temporal prefetch instructions, rather than directly using non-temporal load instructions like movntdqa. This is in contrast to non-temporal stores where you can just use the corresponding non-temporal store instructions3 directly.
The basic idea is that you issue a prefetchnta to the cache line before any normal loads, and then issue loads as normal. If the line wasn't already in the cache, it will be loaded in a non-temporal fashion. The exact meaning of non-temporal fashion depends on the architecture but the general pattern is that the line is loaded into, at least the L1 and perhaps some higher cache levels. Indeed for a prefetch to be of any use it needs to cause the line to load, at least into some cache level for consumption by a later load. The line may also be treated specially in the cache, for example by flagging it as high priority for eviction or restricting the ways in which it can be placed.
The upshot of all this is that while non-temporal loads are supported in a sense, they are really only partly non-temporal, unlike stores where you really leave no trace of the line in any of the cache levels. Non-temporal loads will cause some cache pollution, but generally less than regular loads. The exact details are architecture specific, and I've included some details below for modern Intel. You can find a slightly longer writeup in this answer to the question "Non-temporal loads and the hardware prefetcher, do they work together?"
).
Skylake Client
Based on the tests in this answer it seems that the behavior for prefetchnta Skylake is to fetch normally into the L1 cache, to skip the L2 entirely, and fetches in a limited way into the L3 cache (probably into 1 or 2 ways only so the total amount of the L3 available to nta prefetches is limited).
This was tested on Skylake client, but I believe this basic behavior probably extends backwards probably to Sandy Bridge and earlier (based on wording in the Intel optimization guide), and also forwards to Kaby Lake and later architectures based on Skylake client. So unless you are using a Skylake-SP or Skylake-X part, or an extremely old CPU, this is probably the behavior you can expect from prefetchnta.
Skylake Server
The only recent Intel chip known to have different behavior is Skylake server (used in Skylake-X, Skylake-SP and a few other lines). This has a considerably changed L2 and L3 architecture, and the L3 is no longer inclusive of the much larger L2. For this chip, it seems that prefetchnta skips both the L2 and L3 caches, so on this architecture cache pollution is limited to the L1.
This behavior was reported by user Mysticial in a comment. The downside, as pointed out in those comments is that this makes prefetchnta much more brittle: if you get the prefetch distance or timing wrong (especially easy when hyperthreading is involved and the sibling core is active), and the data gets evicted from L1 before you use, you are going all the way back to main memory rather than the L3 on earlier architectures.
1 Recent here probably means anything in the last decade or so, but I don't mean to imply that earlier hardware didn't support non-temporal prefetch: it's possible that support goes right back to the introduction of prefetchnta but I don't have the hardware to check that and can't find an existing reliable source of information on it.
2 Normal here just means WB (writeback) memory, which is the memory dealing with at the application level the overwhelming majority of the time.
3 Specifically, the NT store instructions are movnti for general purpose registers and the movntd* and movntp* families for SIMD registers.
I answer my own question since I found the following post from Intel Developer Forum, which makes sense for me. It was written by John McCalpin:
The results for the mainstream processors are not surprising -- in the absence of true "scratchpad" memory, it is not clear that it is possible to design an implementation of "non-temporal" behavior that is not subject to nasty surprises. Two approaches that have been used in the past are (1) loading the cache line, but marking it LRU instead of MRU, and (2) loading the cache line into one specific "set" of the set-associative cache. In either case it is relatively easy to generate situations in which the cache drops the data before the processor completes reading it.
Both of these approaches risk performance degradation in cases operating on more than a small number of arrays, and are made much more difficult to implement without "gotchas" when HyperThreading is considered.
In other contexts I have argued for the implementation of "load multiple" instructions that would guarantee that the entire contents of a cache line would be copied to registers atomically. My reasoning is that the hardware absolutely guarantees that the cache line is moved atomically and that the time required to copy the remainder of the cache line to registers was so small (an extra 1-3 cycles, depending on the processor generation) that it could be safely implemented as an atomic operation.
Starting with Haswell, the core can read 64 Bytes in a single cycle (2 256-bit aligned AVX reads), so the exposure to unintended side effects becomes even lower.
Starting with KNL, full-cache-line (aligned) loads should be "naturally" atomic, since the transfers from the L1 Data Cache to the core are full cache lines and all of the data is placed into the target AVX-512 register. (This does not mean that Intel guarantees atomicity in the implementation! We don't have visibility into the horrible corner cases that the designers have to account for, but it is reasonable to conclude that most of the time aligned 512-bit loads will occur atomically.) With this "natural" 64-Byte atomicity, some of the tricks used in the past for reducing cache pollution due to "non-temporal" loads may deserve another look....
The MOVNTDQA instruction is intended primarily for reading from address ranges that are mapped as "Write-Combining" (WC), and not for reading from normal system memory that is mapped "Write-Back" (WB). The description in Volume 2 of the SWDM says that an implementation "may" do something special with MOVNTDQA for WB regions, but the emphasis is on the behavior for the WC memory type.
The "Write-Combining" memory type is almost never used for "real" memory --- it is used almost exclusively for Memory-Mapped IO regions.
See here for the whole post: https://software.intel.com/en-us/forums/intel-isa-extensions/topic/597075
I have the following program C++ program which uses no communication, and the same identical work is done on all cores, I know that this doesn't use parallel processing at all:
unsigned n = 130000000;
std::vector<double>vec1(n,1.0);
std::vector<double>vec2(n,1.0);
double precision :: t1,t2,dt;
t1 = MPI_Wtime();
for (unsigned i = 0; i < n; i++)
{
// Do something so it's not a trivial loop
vec1[i] = vec2[i]+i;
}
t2 = MPI_Wtime();
dt = t2-t1;
I'm running this program in a single node with two
Intel® Xeon® Processor E5-2690 v3, so I have 24 cores all together. This is a dedicated node, no one else is using it.
Since there is no communication, and each processor is doing the same amount of (identical) work, running it on multiple processors should give the same time. However, I get the following times (averaged time over all cores):
1 core: 0.237
2 cores: 0.240
4 cores: 0.241
8 cores: 0.261
16 cores: 0.454
What could cause the increase in time? Particularly for 16 cores.
I have ran callgrind and I get the roughly same amount of data/instruction misses on all cores (the percentage of misses are the same).
I have repeated the same test on a node with two Intel® Xeon® Processor E5-2628L v2, (16 cores all together), I observe the same increase in execution times. Is this something to do with the MPI implementation?
Considering you are using ~2 GiB of memory per rank, your code is memory-bound. Except for prefetchers you are not operating within the cache but in main memory. You are simply hitting the memory bandwidth at a certain number of active cores.
Another aspect can be turbo mode, if enabled. Turbo mode can increase the core frequency to higher levels if less cores are utilized. As long as the memory bandwidth is not saturated, the higher frequency from turbo core will increase the bandwidth each core gets. This paper discusses the available aggregate memory bandwidth on Haswell processors depending on number of active cores and frequency (Fig 7./8.)
Please note that this has nothing to do with MPI / OpenMPI. You might as well launch the same program X times via any other mean.
I suspect that there are common resources that should be used by your program, so when the number of them increases, there are delays, so that a resource is free'ed so that it can be used by the other process.
You see, you may have 24 cores, but that doesn't mean that all your system allows every core to do everything concurrent. As mentioned in the comments, the memory access is one thing that might cause delays (due to traffic), same thing for disk.
Also consider the interconnection network, which can also suffer from many accesses. In conclusion, notice that these hardware delays are enough to overwhelm the processing time.
General note: Remember how Efficiency of a program is defined:
E = S/p, where S is the speedup and p the number of nodes/processes/threads
Now take Scalability into account. Usually programs are weakly scalable, i.e. that you have to increase with the same rate the size of the problem and p. By increasing only the number of p, while keeping the size of your problem (n in your case) constant, while keeping Efficiency constant, yields a strongly Scalable program.
Your program is not using parallel processing at all. Just because you have compiled it with OpenMP does not make it parallel.
To parallelize the for loop, for example, you need to use the different #pragma's OpenMP offer.
unsigned n = 130000000;
std::vector<double>vec1(n,1.0);
std::vector<double>vec2(n,1.0);
double precision :: t1,t2,dt;
t1 = MPI_Wtime();
#pragma omp parallel for
for (unsigned i = 0; i < n; i++)
{
// Do something so it's not a trivial loop
vec1[i] = vec2[i]+i;
}
t2 = MPI_Wtime();
dt = t2-t1;
However, take into account that for large values of n, the impact of cache misses may hide the perfomance gained with multiple cores.
An open source C++/Qt app I'm interested in depends on CUDA. My macbook pro (mid 2014) has the stock Intel Iris Pro, and no NVidia graphics card. Naturally, the pre-built app won't run.
I found this emulator: https://github.com/gtcasl/gpuocelot - but it's only tested against Linux, and there are several open issues about it not compiling on the Mac.
I have the source - can I replace the CUDA dependency with c++ equivalents, at the cost of slower processing? I'm hoping for something like
rename file extensions: .cu to .cpp
remove CUDA references from make file
replace CUDA headers with equivalent c++ std lib headers
adjust makefile, adding missing library references as needed
fix remaining missing function calls (hopefully only one or two) with c++ code (possibly cribbed from Ocelot)
But I'm afraid it's not that simple. I'd like a sanity check before I begin.
In the general case, I don't think there is a specific roadmap to "de-CUDA-fy" an application. Just as I don't think there is a specific "mechanical" roadmap to "CUDA-fy" an application, nor do I find specific roadmaps for programming problems in general.
Furthermore, I think the proposed roadmap has flaws. To pick just one example, a .cu file will normally have CUDA-specific references that will not be tolerated by an ordinary c++ compiler used to compile a .cpp code. Some of these references may be items that depend on the CUDA runtime API, such as cudaMalloc and cudaMemcpy, and although these could be made to pass through an ordinary c++ compiler (they are just library calls) it would not be sensible to leave those in-place for an application that has the CUDA character removed. Furthermore, some of the references may be CUDA specific language features such as declaration of device code via __global__ or __device__ or launching of a device "kernel" function with it's corresponding syntax <<<...>>>. These cannot be made to pass through an ordinary c++ compiler, and would have to be dealt with specifically. Furthermore, simply deleting those CUDA keywords and syntax would be very unlikely to produce useful results.
In short, the code would have to be refactored; there is no reasonably concise roadmap that explains a more-or-less mechanical process to do so. I suggest the complexity of the refactoring process would be approximately the same complexity as the original process (if there was one) to convert a non-CUDA version of the code to a CUDA version. At a minimum, some non-mechanical knowledge of CUDA programming would be required in order to understand the CUDA constructs.
For very simple CUDA codes, it might be possible to lay out a somewhat mechanical process to de-CUDA-fy the code. To recap, the basic CUDA processing sequence is as follows:
allocate space for data on the device (perhaps with cudaMalloc) and copy data to the device (perhaps with cudaMemcpy)
launch a function that runs on the device (a __global__ or "kernel" function) to process the data and create results
copy results back from the device (perhaps, again, with cudaMemcpy)
Therefore, a straightforward approach would be to:
eliminate the cudaMalloc/cudaMemcpy operations, thus leaving the data of interest in its original form, on the host
convert the cuda processing functions (kernels) to ordinary c++ functions, that perform the same operation on the host data
Since CUDA is a parallel processing architecture, one approach to convert an inherently parallel CUDA "kernel" code to ordinary c++ code (step 2 above) would be to use a loop or a set of loops. But beyond that the roadmap tends to get quite divergent, depending on what the code is actually doing. In addition, inter-thread communication, non-transformational algorithms (such as reductions), and use of CUDA intrinsics or other language specific features will considerably complicate step 2.
For example lets take a very simple vector ADD code. The CUDA kernel code for this would be distinguished by a number of characteristics that would make it easy to convert to or from a CUDA realization:
There is no inter-thread communication. The problem is "embarassingly parallel". The work done by each thread is independent of all other threads. This describes only a limited subset of CUDA codes.
There is no need or use of any CUDA specific language features or intrinsics (other than a globally unique thread index variable), so the kernel code is recognizable as almost completely valid c++ code already. Again, this characteristic probably describes only a limited subset of CUDA codes.
So the CUDA version of the vector add code might look like this (drastically simplified for presentation purposes):
#include <stdio.h>
#define N 512
// perform c = a + b vector add
__global__ void vector_add(const float *a, const float *b, float *c){
int idx = threadIdx.x;
c[idx]=a[idx]+b[idx];
}
int main(){
float a[N] = {1};
float b[N] = {2};
float c[N] = {0};
float *d_a, *d_b, *d_c;
int dsize = N*sizeof(float);
cudaMalloc(&d_a, dsize); // step 1 of CUDA processing sequence
cudaMalloc(&d_b, dsize);
cudaMalloc(&d_c, dsize);
cudaMemcpy(d_a, a, dsize, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, dsize, cudaMemcpyHostToDevice);
vector_add<<<1,N>>>(d_a, d_b, d_c); // step 2
cudaMemcpy(c, d_c, dsize, cudaMemcpyDeviceToHost); // step 3
for (int i = 0; i < N; i++) if (c[i] != a[i]+b[i]) {printf("Fail!\n"); return 1;}
printf("Success!\n");
return 0;
}
We see that the above code follows the typical CUDA processing sequence 1-2-3 and the beginning of each step is marked in the comments. So our "de-CUDA-fy" roadmap is, again:
eliminate the cudaMalloc/cudaMemcpy operations, thus leaving the data of interest in its original form, on the host
convert the cuda processing functions (kernels) to ordinary c++ functions, that perform the same operation on the host data
For step 1, we will literally just delete the cudaMalloc and cudaMemcpy lines, and we will instead plan to operate directly on the a[], b[] and c[] variables in the host code. The remaining step, then, is to convert the vector_add CUDA "kernel" function to an ordinary c++ function. Again, some knowledge of CUDA fundamentals is necessary to understand the extent of the operation being performed in parallel. But the kernel code itself (other than the use of the threadIdx.x built-in CUDA variable) is completely valid c++ code, and there is no inter-thread communication or other complicating factors. So an ordinary c++ realization could just be the kernel code, placed into a suitable for-loop iterating over the parallel extent (N in this case), and placed into a comparable c++ function:
void vector_add(const float *a, const float *b, float *c){
for (int idx=0; idx < N; idx++)
c[idx]=a[idx]+b[idx];
}
Combining the above steps, we need to (in this trivial example):
delete the cudaMalloc and cudaMemcpy operations
replace the cuda kernel code with a similar, ordinary c++ function
fixup the kernel invocation in main to be an ordinary c++ function call
Which gives us:
#include <stdio.h>
#define N 512
// perform c = a + b vector add
void vector_add(const float *a, const float *b, float *c){
for (int idx = 0; idx < N; idx++)
c[idx]=a[idx]+b[idx];
}
int main(){
float a[N] = {1};
float b[N] = {2};
float c[N] = {0};
vector_add(a, b, c);
for (int i = 0; i < N; i++) if (c[i] != a[i]+b[i]) {printf("Fail!\n"); return 1;}
printf("Success!\n");
return 0;
}
The point of working through this example is not to suggest the process will be in general this trivially simple. But hopefully it is evident that the process is not a purely mechanical one, but depends on some knowledge of CUDA and also requires some actual code refactoring; it cannot be done simply by changing file extensions and modifying a few function calls.
A few other comments:
Many laptops are available which have CUDA-capable (i.e. NVIDIA) GPUs in them. If you have one of these (I realize you don't but I'm including this for others who may read this), you can probably run CUDA codes on it.
If you have an available desktop PC, it's likely that for less than $100 you could add a CUDA-capable GPU to it.
Trying to leverage emulation technology IMO is not the way to go here, unless you can use it in a turnkey fashion. Cobbling bits and pieces from an emulator into an application of your own is a non-trivial exercise, in my opinion.
I believe in the general case, conversion of a CUDA code to a corresponding OpenCL code will not be trivial either. (The motivation here is that there is a lot of similarity between CUDA and OpenCL, and an OpenCL code could probably be made to run on your laptop, as OpenCL codes can usually be run on a variety of targets, including CPUs and GPUs). There are enough differences between the two technologies that it requires some effort, and this brings the added burden of requiring some level of familiarity with both OpenCL and CUDA, and the thrust of your question seems to be wanting to avoid those learning curves.
Simple question:
Is it possible to compute or get the best pitch for an array without allocating memory as in
cudaMallocPitch(void** p, size_t *pitch, size_t width, size_t height)
I would like to get the pitch, without allocating the memory and then use the function cudaMalloc instead!
(this is crucial if one wants to implement some caching allocator for pitched allocations for the cuda platform)
Is it:
// round width to next mulitple of prop.textureAlignment;
size_t proper_pitch = ((width / (size_t)device.m_prob.textureAlignment) + 1) * device.m_prob.textureAlignment;
Update:
I now calculate the proper_pitch as the smallest upper multiple of 32/64/128 bytes:
I have no tried this, and I still don't know what else the runtime API could do, maybe look at the already allocated memory and do some fitting? In the CUDA Programming Guide, for fully-coalesced access the above is a necessary requirement (not sufficient, since at runtime warps need to access continously)...
// use Cuda Programming Guide Alignmenet (which should be the best i think)
// Upper closest multible of 32/64/128
//size_t upperMultOf32 = ((widthInBytes + 32 - 1)/32)*32; // ((widthInBytes-1)/32 + 1)*32
proper_pitch = std::min(
std::min( ((widthInBytes + 32 - 1)>>5)<<5 , ((widthInBytes + 64 - 1)>>6)<<6 ),
((widthInBytes + 128 - 1)>>7)<<7
);
At present there is no way of obtaining the pitch calculation. The details are probably hardware version specific, and NVIDIA have neither documented the calculations, nor exposed the calculations via an API (although, as pointed out it would be trivial for them to do so).
If this is a serious limitation for a real world use-case, I would recommend raising a bug report/feature request via the NVIDIA registered developer's portal. In my experience, they do listen to serious feature requests.
[This answer was assembled mostly from comments and added as a community wiki entry to get this question off the unanswered list]