An open source C++/Qt app I'm interested in depends on CUDA. My macbook pro (mid 2014) has the stock Intel Iris Pro, and no NVidia graphics card. Naturally, the pre-built app won't run.
I found this emulator: https://github.com/gtcasl/gpuocelot - but it's only tested against Linux, and there are several open issues about it not compiling on the Mac.
I have the source - can I replace the CUDA dependency with c++ equivalents, at the cost of slower processing? I'm hoping for something like
rename file extensions: .cu to .cpp
remove CUDA references from make file
replace CUDA headers with equivalent c++ std lib headers
adjust makefile, adding missing library references as needed
fix remaining missing function calls (hopefully only one or two) with c++ code (possibly cribbed from Ocelot)
But I'm afraid it's not that simple. I'd like a sanity check before I begin.
In the general case, I don't think there is a specific roadmap to "de-CUDA-fy" an application. Just as I don't think there is a specific "mechanical" roadmap to "CUDA-fy" an application, nor do I find specific roadmaps for programming problems in general.
Furthermore, I think the proposed roadmap has flaws. To pick just one example, a .cu file will normally have CUDA-specific references that will not be tolerated by an ordinary c++ compiler used to compile a .cpp code. Some of these references may be items that depend on the CUDA runtime API, such as cudaMalloc and cudaMemcpy, and although these could be made to pass through an ordinary c++ compiler (they are just library calls) it would not be sensible to leave those in-place for an application that has the CUDA character removed. Furthermore, some of the references may be CUDA specific language features such as declaration of device code via __global__ or __device__ or launching of a device "kernel" function with it's corresponding syntax <<<...>>>. These cannot be made to pass through an ordinary c++ compiler, and would have to be dealt with specifically. Furthermore, simply deleting those CUDA keywords and syntax would be very unlikely to produce useful results.
In short, the code would have to be refactored; there is no reasonably concise roadmap that explains a more-or-less mechanical process to do so. I suggest the complexity of the refactoring process would be approximately the same complexity as the original process (if there was one) to convert a non-CUDA version of the code to a CUDA version. At a minimum, some non-mechanical knowledge of CUDA programming would be required in order to understand the CUDA constructs.
For very simple CUDA codes, it might be possible to lay out a somewhat mechanical process to de-CUDA-fy the code. To recap, the basic CUDA processing sequence is as follows:
allocate space for data on the device (perhaps with cudaMalloc) and copy data to the device (perhaps with cudaMemcpy)
launch a function that runs on the device (a __global__ or "kernel" function) to process the data and create results
copy results back from the device (perhaps, again, with cudaMemcpy)
Therefore, a straightforward approach would be to:
eliminate the cudaMalloc/cudaMemcpy operations, thus leaving the data of interest in its original form, on the host
convert the cuda processing functions (kernels) to ordinary c++ functions, that perform the same operation on the host data
Since CUDA is a parallel processing architecture, one approach to convert an inherently parallel CUDA "kernel" code to ordinary c++ code (step 2 above) would be to use a loop or a set of loops. But beyond that the roadmap tends to get quite divergent, depending on what the code is actually doing. In addition, inter-thread communication, non-transformational algorithms (such as reductions), and use of CUDA intrinsics or other language specific features will considerably complicate step 2.
For example lets take a very simple vector ADD code. The CUDA kernel code for this would be distinguished by a number of characteristics that would make it easy to convert to or from a CUDA realization:
There is no inter-thread communication. The problem is "embarassingly parallel". The work done by each thread is independent of all other threads. This describes only a limited subset of CUDA codes.
There is no need or use of any CUDA specific language features or intrinsics (other than a globally unique thread index variable), so the kernel code is recognizable as almost completely valid c++ code already. Again, this characteristic probably describes only a limited subset of CUDA codes.
So the CUDA version of the vector add code might look like this (drastically simplified for presentation purposes):
#include <stdio.h>
#define N 512
// perform c = a + b vector add
__global__ void vector_add(const float *a, const float *b, float *c){
int idx = threadIdx.x;
c[idx]=a[idx]+b[idx];
}
int main(){
float a[N] = {1};
float b[N] = {2};
float c[N] = {0};
float *d_a, *d_b, *d_c;
int dsize = N*sizeof(float);
cudaMalloc(&d_a, dsize); // step 1 of CUDA processing sequence
cudaMalloc(&d_b, dsize);
cudaMalloc(&d_c, dsize);
cudaMemcpy(d_a, a, dsize, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, dsize, cudaMemcpyHostToDevice);
vector_add<<<1,N>>>(d_a, d_b, d_c); // step 2
cudaMemcpy(c, d_c, dsize, cudaMemcpyDeviceToHost); // step 3
for (int i = 0; i < N; i++) if (c[i] != a[i]+b[i]) {printf("Fail!\n"); return 1;}
printf("Success!\n");
return 0;
}
We see that the above code follows the typical CUDA processing sequence 1-2-3 and the beginning of each step is marked in the comments. So our "de-CUDA-fy" roadmap is, again:
eliminate the cudaMalloc/cudaMemcpy operations, thus leaving the data of interest in its original form, on the host
convert the cuda processing functions (kernels) to ordinary c++ functions, that perform the same operation on the host data
For step 1, we will literally just delete the cudaMalloc and cudaMemcpy lines, and we will instead plan to operate directly on the a[], b[] and c[] variables in the host code. The remaining step, then, is to convert the vector_add CUDA "kernel" function to an ordinary c++ function. Again, some knowledge of CUDA fundamentals is necessary to understand the extent of the operation being performed in parallel. But the kernel code itself (other than the use of the threadIdx.x built-in CUDA variable) is completely valid c++ code, and there is no inter-thread communication or other complicating factors. So an ordinary c++ realization could just be the kernel code, placed into a suitable for-loop iterating over the parallel extent (N in this case), and placed into a comparable c++ function:
void vector_add(const float *a, const float *b, float *c){
for (int idx=0; idx < N; idx++)
c[idx]=a[idx]+b[idx];
}
Combining the above steps, we need to (in this trivial example):
delete the cudaMalloc and cudaMemcpy operations
replace the cuda kernel code with a similar, ordinary c++ function
fixup the kernel invocation in main to be an ordinary c++ function call
Which gives us:
#include <stdio.h>
#define N 512
// perform c = a + b vector add
void vector_add(const float *a, const float *b, float *c){
for (int idx = 0; idx < N; idx++)
c[idx]=a[idx]+b[idx];
}
int main(){
float a[N] = {1};
float b[N] = {2};
float c[N] = {0};
vector_add(a, b, c);
for (int i = 0; i < N; i++) if (c[i] != a[i]+b[i]) {printf("Fail!\n"); return 1;}
printf("Success!\n");
return 0;
}
The point of working through this example is not to suggest the process will be in general this trivially simple. But hopefully it is evident that the process is not a purely mechanical one, but depends on some knowledge of CUDA and also requires some actual code refactoring; it cannot be done simply by changing file extensions and modifying a few function calls.
A few other comments:
Many laptops are available which have CUDA-capable (i.e. NVIDIA) GPUs in them. If you have one of these (I realize you don't but I'm including this for others who may read this), you can probably run CUDA codes on it.
If you have an available desktop PC, it's likely that for less than $100 you could add a CUDA-capable GPU to it.
Trying to leverage emulation technology IMO is not the way to go here, unless you can use it in a turnkey fashion. Cobbling bits and pieces from an emulator into an application of your own is a non-trivial exercise, in my opinion.
I believe in the general case, conversion of a CUDA code to a corresponding OpenCL code will not be trivial either. (The motivation here is that there is a lot of similarity between CUDA and OpenCL, and an OpenCL code could probably be made to run on your laptop, as OpenCL codes can usually be run on a variety of targets, including CPUs and GPUs). There are enough differences between the two technologies that it requires some effort, and this brings the added burden of requiring some level of familiarity with both OpenCL and CUDA, and the thrust of your question seems to be wanting to avoid those learning curves.
Related
I've been contributing to an OpenCL program called mfakto that trial factors Mersenne numbers for GIMPS. It uses a modified Sieve of Eratosthenes to create a list of potential factors and then uses modular exponentiation to test the factors. The sieving step can be done on either the GPU or CPU while the modular exponentiation step is done only on the target device.
The program uses different kernels depending on the number's size. I'm able to initialize the 15-bit kernels without issues. However, clEnqueueNDRangeKernel() throws a CL_INVALID_KERNEL_ARGS error for 32-bit kernels even though I've set all the arguments. Here's a sample 32-bit kernel:
__kernel void cl_barrett32_76(__private uint exponent, const int96_t k_base, const __global uint * restrict k_tab, const int shiftcount,
#ifdef WA_FOR_CATALYST11_10_BUG
const uint8 b_in,
#else
const __private int192_t bb,
#endif
__global uint * restrict RES, const int bit_max65
MODBASECASE_PAR_DEF )
{
...
}
In normal circumstances, the kernel takes eight arguments. MODBASECASE_PAR_DEF adds a ninth argument that is only used when the application is compiled with certain debug flags. I traced the code and verified that clSetKernelArg() is used to set each argument at least once.
The issue only occurs when the OpenCL code is run on the CPU on macOS. It does not come up when the program is run on an AMD GPU or on any device on Windows.
Apple's OpenCL implementation only supports a kernel work-group size of 128 on the CPU, but I've already added checks to prevent clEnqueueNDRangeKernel() from trying to use more local threads than the kernel allows.
I realize this is a very specific problem in a very complex application, but any advice would be appreciated.
I suspect the issue might be related to the int96_t and int192_t types you are using. Are those typedefs for structs? The OpenCL standard is not clear on the use of non-primitive, non-buffer types as kernel arguments, so you will often find this working for some implementations but failing for others.
I suggest putting the data for these arguments in OpenCL buffers, and declaring them in the kernel function signature as constant int96_t* or global int96_t and similar. You'll obviously need to dereference the pointer in the kernel, either on every use of the value or when copying it into a private variable once.
I was testing the new CUDA 8 along with the Pascal Titan X GPU and is expecting speed up for my code but for some reason it ends up being slower. I am on Ubuntu 16.04.
Here is the minimum code that can reproduce the result:
CUDASample.cuh
class CUDASample{
public:
void AddOneToVector(std::vector<int> &in);
};
CUDASample.cu
__global__ static void CUDAKernelAddOneToVector(int *data)
{
const int x = blockIdx.x * blockDim.x + threadIdx.x;
const int y = blockIdx.y * blockDim.y + threadIdx.y;
const int mx = gridDim.x * blockDim.x;
data[y * mx + x] = data[y * mx + x] + 1.0f;
}
void CUDASample::AddOneToVector(std::vector<int> &in){
int *data;
cudaMallocManaged(reinterpret_cast<void **>(&data),
in.size() * sizeof(int),
cudaMemAttachGlobal);
for (std::size_t i = 0; i < in.size(); i++){
data[i] = in.at(i);
}
dim3 blks(in.size()/(16*32),1);
dim3 threads(32, 16);
CUDAKernelAddOneToVector<<<blks, threads>>>(data);
cudaDeviceSynchronize();
for (std::size_t i = 0; i < in.size(); i++){
in.at(i) = data[i];
}
cudaFree(data);
}
Main.cpp
std::vector<int> v;
for (int i = 0; i < 8192000; i++){
v.push_back(i);
}
CUDASample cudasample;
cudasample.AddOneToVector(v);
The only difference is the NVCC flag, which for the Pascal Titan X is:
-gencode arch=compute_61,code=sm_61-std=c++11;
and for the old Maxwell Titan X is:
-gencode arch=compute_52,code=sm_52-std=c++11;
EDIT: Here are the results for running NVIDIA Visual Profiling.
For the old Maxwell Titan, the time for memory transfer is around 205 ms, and the kernel launch is around 268 us.
For the Pascal Titan, the time for memory transfer is around 202 ms, and the kernel launch is around an insanely long 8343 us, which makes me believe something is wrong.
I further isolate the problem by replacing cudaMallocManaged into good old cudaMalloc and did some profiling and observe some interesting result.
CUDASample.cu
__global__ static void CUDAKernelAddOneToVector(int *data)
{
const int x = blockIdx.x * blockDim.x + threadIdx.x;
const int y = blockIdx.y * blockDim.y + threadIdx.y;
const int mx = gridDim.x * blockDim.x;
data[y * mx + x] = data[y * mx + x] + 1.0f;
}
void CUDASample::AddOneToVector(std::vector<int> &in){
int *data;
cudaMalloc(reinterpret_cast<void **>(&data), in.size() * sizeof(int));
cudaMemcpy(reinterpret_cast<void*>(data),reinterpret_cast<void*>(in.data()),
in.size() * sizeof(int), cudaMemcpyHostToDevice);
dim3 blks(in.size()/(16*32),1);
dim3 threads(32, 16);
CUDAKernelAddOneToVector<<<blks, threads>>>(data);
cudaDeviceSynchronize();
cudaMemcpy(reinterpret_cast<void*>(in.data()),reinterpret_cast<void*>(data),
in.size() * sizeof(int), cudaMemcpyDeviceToHost);
cudaFree(data);
}
For the old Maxwell Titan, the time for memory transfer is around 5 ms both ways, and the kernel launch is around 264 us.
For the Pascal Titan, the time for memory transfer is around 5 ms both ways, and the kernel launch is around 194 us, which actually results in the performance increase I am hoping to see...
Why is Pascal GPU so slow on running CUDA kernels when cudaMallocManaged is used? It will be a travesty if I have to revert all my existing code that uses cudaMallocManaged into cudaMalloc. This experiment also shows that the memory transfer time using cudaMallocManaged is a lot slower than using cudaMalloc, which also feels like something is wrong. If using this results in a slow run time even the code is easier, this should be unacceptable because the whole purpose of using CUDA instead of plain C++ is to speed things up. What am I doing wrong and why am I observing this kind of result?
Under CUDA 8 with Pascal GPUs, managed memory data migration under a unified memory (UM) regime will generally occur differently than on previous architectures, and you are experiencing the effects of this. (Also see note at the end about CUDA 9 updated behavior for windows.)
With previous architectures (e.g. Maxwell), managed allocations used by a particular kernel call will be migrated all at once, upon launch of the kernel, approximately as if you called cudaMemcpy to move the data yourself.
With CUDA 8 and Pascal GPUs, data migration occurs via demand-paging. At kernel launch, by default, no data is explicitly migrated to the device(*). When the GPU device code attempts to access data in a particular page that is not resident in GPU memory, a page fault will occur. The net effect of this page fault is to:
Cause the GPU kernel code (the thread or threads that accessed the page) to stall (until step 2 is complete)
Cause that page of memory to be migrated from the CPU to the GPU
This process will be repeated as necessary, as GPU code touches various pages of data. The sequence of operations involved in step 2 above involves some latency as the page fault is processed, in addition to the time spent to actually move the data. Since this process will move data a page at a time, it may be signficantly less efficient than moving all the data at once, either using cudaMemcpy or else via the pre-Pascal UM arrangement that caused all data to be moved at kernel launch (whether it was needed or not, and regardless of when the kernel code actually needed it).
Both approaches have their pros and cons, and I don't wish to debate the merits or various opinions or viewpoints. The demand-paging process enables a great many important features and capabilities for Pascal GPUs.
This particular code example, however, does not benefit. This was anticipated, and so the recommended use to bring the behavior in line with previous (e.g. maxwell) behavior/performance is to precede the kernel launch with a cudaMemPrefetchAsync() call.
You would use the CUDA stream semantics to force this call to complete prior to the kernel launch (if the kernel launch does not specify a stream, you can pass NULL for the stream parameter, to select the default stream). I believe the other parameters for this function call are pretty self-explanatory.
With this function call before your kernel call, covering the data in question, you should not observe any page-faulting in the Pascal case, and the profile behavior should be similar to the Maxwell case.
As I mentioned in the comments, if you had created a test case that involved two kernel calls in sequence, you would have observed that the 2nd call runs at approximately full speed even in the Pascal case, since all of the data has already been migrated to the GPU side through the first kernel execution. Therefore, the use of this prefetch function should not be considered mandatory or automatic, but should be used thoughtfully. There are situations where the GPU may be able to hide the latency of page-faulting to some degree, and obviously data already resident on the GPU does not need to be prefetched.
Note that the "stall" referred to in step 1 above is possibly misleading. A memory access by itself does not trigger a stall. But if the data requested is actually needed for an operation, e.g. a multiply, then the warp will stall at the multiply operation, until the necessary data becomes available. A related point, then, is that demand-paging of data from host to device in this fashion is just another "latency" that the GPU can possibly hide in it's latency-hiding architecture, if there is sufficient other available "work" to attend to.
As an additional note, in CUDA 9, the demand-paging regime for pascal and beyond is only available on linux; the previous support for Windows advertised in CUDA 8 has been dropped. See here. On windows, even for Pascal devices and beyond, as of CUDA 9, the UM regime is the same as maxwell and prior devices; data is migrated to the GPU en-masse, at kernel launch.
(*) The assumption here is that data is "resident" on the host, i.e. already "touched" or initialized in CPU code, after the managed allocation call. The managed allocation itself creates data pages associated with the device, and when CPU code "touches" these pages, the CUDA runtime will demand-page the necessary pages to be resident in host memory, so that the CPU can use them. If you perform an allocation but never "touch" the data in CPU code (an odd situation, probably) then it will actually already be "resident" in device memory when the kernel runs, and the observed behavior will be different. But that is not the case in view for this particular example/question.
Additional information is available in this blog article.
I can reproduce this in three programms on a 1060 and a 1080. As example i use a voulme render with procedural transferfunction which was nearly interactive real time on a 960 but on a 1080 is a slight show. All data are stored in read only textures and only my transferfunctions are in Managed Memory. In difference to my other code the volume render runs especially slow, this is becaus in differece to my other code my transferfunctions are passed from the kernel to other device methods.
I belive that it is not only the calling of kernels with cudaMallocManaged data. My expierence go to that every call of a kernel or device methode has this behavior and the effect adds up. Also the basis of the volume render is in parts the provided CudaSample without Managed Memory, which runs as expected on Maxwell an pascal GPUs (1080, 1060,980Ti,980,960).
I just yesterday found this bug, because we changed all of oure reaserch systems to pascal. I will profile my software in the next days on a 980 in comapre to a 1080. I'm not yet sure if i should report a bug in the NVIDIA developer zone.
it is a BUG of NVIDIA on Windows Systems witch occurs with PASCAL architecture.
I know this since a few days, but could not write it here because i was on vacation without internet connection.
For details see the comments of: https://devblogs.nvidia.com/parallelforall/unified-memory-cuda-beginners/
where Mark Harris from NVIDIA confirms the Bug. It should be corrected with CUDA 9. He also tells that it should be communicated to Microsoft to help the caus. But i don't found a suitable Microsoft Bug Report Page till now.
nvcc device code has access to a built-in value, warpSize, which is set to the warp size of the device executing the kernel (i.e. 32 for the foreseeable future). Usually you can't tell it apart from a constant - but if you try to declare an array of length warpSize you get a complaint about it being non-const... (with CUDA 7.5)
So, at least for that purpose you are motivated to have something like (edit):
enum : unsigned int { warp_size = 32 };
somewhere in your headers. But now - which should I prefer, and when? : warpSize, or warp_size?
Edit: warpSize is apparently a compile-time constant in PTX. Still, the question stands.
Let's get a couple of points straight. The warp size isn't a compile time constant and shouldn't be treated as one. It is an architecture specific runtime immediate constant (and its value just happens to be 32 for all architectures to date). Once upon a time, the old Open64 compiler did emit a constant into PTX, however that changed at least 6 years ago if my memory doesn't fail me.
The value is available:
In CUDA C via warpSize, where is is not a compile time constant (the PTX WARP_SZ variable is emitted by the compiler in such cases).
In PTX assembler via WARP_SZ, where it is a runtime immediate constant
From the runtime API as a device property
Don't declare you own constant for the warp size, that is just asking for trouble. The normal use case for an in-kernel array dimensioned to be some multiple of the warp size would be to use dynamically allocated shared memory. You can read the warp size from the host API at runtime to get it. If you have a statically declared in-kernel you need to dimension from the warp size, use templates and select the correct instance at runtime. The latter might seem like unnecessary theatre, but it is the right thing to do for a use case that almost never arises in practice. The choice is yours.
Contrary to talonmies's answer I find warp_size constant perfectly acceptable. The only reason to use warpSize is to make the code forward-compatibly with a possible future hardware that may have warps of different size. However, when such hardware arrives, the kernel code will most likely require other alterations as well in order to remain efficient. CUDA is not a hardware-agnostic language - on the contrary, it is still quite a low-level programming language. Production code uses various intrinsic functions that come and go over time (e.g. __umul24).
The day we get a different warp size (e.g. 64) many things will change:
The warpSize will have to be adjusted obviously
Many warp-level intrinsic will need their signature adjusted, or a new version produced, e.g. int __ballot, and while int does not need to be 32-bit, it is most commonly so!
Iterative operations, such as warp-level reductions, will need their number of iterations adjusted. I have never seen anyone writing:
for (int i = 0; i < log2(warpSize); ++i) ...
that would be overly complex in something that is usually a time-critical piece of code.
warpIdx and laneIdx computation out of threadIdx would need to be adjusted. Currently, the most typical code I see for it is:
warpIdx = threadIdx.x/32;
laneIdx = threadIdx.x%32;
which reduces to simple right-shift and mask operations. However, if you replace 32 with warpSize this suddenly becomes a quite expensive operation!
At the same time, using warpSize in the code prevents optimization, since formally it is not a compile-time known constant.
Also, if the amount of shared memory depends on the warpSize this forces you to use the dynamically allocated shmem (as per talonmies's answer). However, the syntax for that is inconvenient to use, especially when you have several arrays -- this forces you to do pointer arithmetic yourself and manually compute the sum of all memory usage.
Using templates for that warp_size is a partial solution, but adds a layer of syntactic complexity needed at every function call:
deviceFunction<warp_size>(params)
This obfuscates the code. The more boilerplate, the harder the code is to read and maintain.
My suggestion would be to have a single header that control all the model-specific constants, e.g.
#if __CUDA_ARCH__ <= 600
//all devices of compute capability <= 6.0
static const int warp_size = 32;
#endif
Now the rest of your CUDA code can use it without any syntactic overhead. The day you decide to add support for newer architecture, you just need to alter this one piece of code.
I would like to create an OpenCL kernel without giving access to it to the end user.
Therefore, I can't use a regular external .cl text file. What are the alternatives, regarding that I would like to avoid creating a huge text string with the kernel?
And yet another question, if I put this code in an hardcoded string, won't it be possible to access that code from some disassembler?
Here you have 2 scenarios:
If you are targeting one single device
If you are targeting any OpenCL device
In the first scenario, there is a possibility to embed the binary data into your executable (using a string). And load it when you run the program.
There would be no reverse engineering possible (unless the already known ones, like assembly), since the program will have compiled code and not the original code you wrote.
The way of doing that would be:
uchar binary_dev1[binarySize] = "..."
uchar * binary = &binary_dev1;
program = clCreateProgramWithBinary(context, 1, &device,
&binarySize,
(const unsigned char**)&binary,
&binaryStatus,
&errNum);
The second alternative involves protecting the source code in the kernel by some sort of "mangling".
Since the mangler code is going to be compiled, reverse engineer it could be complicated.
You can do any mangling you can think of that is reversible, and even combine them. Some ideas:
Compress the code it using a compression format, but hardcode some parameters of the decompresion, to make it less straightforward.
LZ4, ZLIB, etc...
Use an XOR operator on the code. Better if it varies over time, and better if it varies using a non-obvious rule.
For example:
char seq = 0x1A;
for(int i=0; i<len; i++){
out[i] = in[i] ^ seq;
seq = ((seq ^ i) * 78965213) >> 4 + ((seq * i) * 56987) << 4;
}
Encode it using encoding methods that require a key, and are reversible
Use a program that protects your program binary towards reverse engineering, like Themida.
Use SPIR 1.2 for OpenCL 1.2 or SPIR 2.0 for OpenCL 2.0 until SPIR-V for OpenCL 2.1 is available.
Given the arrays:
int canvas[10][10];
int addon[10][10];
Where all the values range from 0 - 100, what is the fastest way in C++ to add those two arrays so each cell in canvas equals itself plus the corresponding cell value in addon?
IE, I want to achieve something like:
canvas += another;
So if canvas[0][0] =3 and addon[0][0] = 2 then canvas[0][0] = 5
Speed is essential here as I am writing a very simple program to brute force a knapsack type problem and there will be tens of millions of combinations.
And as a small extra question (thanks if you can help!) what would be the fastest way of checking if any of the values in canvas exceed 100? Loops are slow!
Here is an SSE4 implementation that should perform pretty well on Nehalem (Core i7):
#include <limits.h>
#include <emmintrin.h>
#include <smmintrin.h>
static inline int canvas_add(int canvas[10][10], int addon[10][10])
{
__m128i * cp = (__m128i *)&canvas[0][0];
const __m128i * ap = (__m128i *)&addon[0][0];
const __m128i vlimit = _mm_set1_epi32(100);
__m128i vmax = _mm_set1_epi32(INT_MIN);
__m128i vcmp;
int cmp;
int i;
for (i = 0; i < 10 * 10; i += 4)
{
__m128i vc = _mm_loadu_si128(cp);
__m128i va = _mm_loadu_si128(ap);
vc = _mm_add_epi32(vc, va);
vmax = _mm_max_epi32(vmax, vc); // SSE4 *
_mm_storeu_si128(cp, vc);
cp++;
ap++;
}
vcmp = _mm_cmpgt_epi32(vmax, vlimit); // SSE4 *
cmp = _mm_testz_si128(vcmp, vcmp); // SSE4 *
return cmp == 0;
}
Compile with gcc -msse4.1 ... or equivalent for your particular development environment.
For older CPUs without SSE4 (and with much more expensive misaligned loads/stores) you'll need to (a) use a suitable combination of SSE2/SSE3 intrinsics to replace the SSE4 operations (marked with an * above) and ideally (b) make sure your data is 16-byte aligned and use aligned loads/stores (_mm_load_si128/_mm_store_si128) in place of _mm_loadu_si128/_mm_storeu_si128.
You can't do anything faster than loops in just C++. You would need to use some platform specific vector instructions. That is, you would need to go down to the assembly language level. However, there are some C++ libraries that try to do this for you, so you can write at a high level and have the library take care of doing the low level SIMD work that is appropriate for whatever architecture you are targetting with your compiler.
MacSTL is a library that you might want to look at. It was originally a Macintosh specific library, but it is cross platform now. See their home page for more info.
The best you're going to do in standard C or C++ is to recast that as a one-dimensional array of 100 numbers and add them in a loop. (Single subscripts will use a bit less processing than double ones, unless the compiler can optimize it out. The only way you're going to know how much of an effect there is, if there is one, is to test.)
You could certainly create a class where the addition would be one simple C++ instruction (canvas += addon;), but that wouldn't speed anything up. All that would happen is that the simple C++ instruction would expand into the loop above.
You would need to get into lower-level processing in order to speed that up. There are additional instructions on many modern CPUs to do such processing that you might be able to use. You might be able to run something like this on a GPU using something like Cuda. You could try making the operation parallel and running on several cores, but on such a small instance you'll have to know how caching works on your CPU.
The alternatives are to improve your algorithm (on a knapsack-type problem, you might be able to use dynamic programming in some way - without more information from you, we can't tell you), or to accept the performance. Tens of millions of operations on a 10 by 10 array turn into hundreds of billions of operations on numbers, and that's not as intimidating as it used to be. Of course, I don't know your usage scenario or performance requirements.
Two parts: first, consider your two-dimensional array [10][10] as a single array [100]. The layout rules of C++ should allow this. Second, check your compiler for intrinsic functions implementing some form of SIMD instructions, such as Intel's SSE. For example Microsoft supplies a set. I believe SSE has some instructions for checking against a maximum value, and even clamping to the maximum if you want.
Here is an alternative.
If you are 100% certain that all your values are between 0 and 100, you could change your type from an int to a uint8_t. Then, you could add 4 elements together at once of them together using uint32_t without worrying about overflow.
That is ...
uint8_t array1[10][10];
uint8_t array2[10][10];
uint8_t dest[10][10];
uint32_t *pArr1 = (uint32_t *) &array1[0][0];
uint32_t *pArr2 = (uint32_t *) &array2[0][0];
uint32_t *pDest = (uint32_t *) &dest[0][0];
int i;
for (i = 0; i < sizeof (dest) / sizeof (uint32_t); i++) {
pDest[i] = pArr1[i] + pArr2[i];
}
It may not be the most elegant, but it could help keep you from going to architecture specific code. Additionally, if you were to do this, I would strongly recommend you comment what you are doing and why.
You should check out CUDA. This kind of problem is right up CUDA's street. Recommend the Programming Massively Parallel Processors book.
However, this does require CUDA capable hardware, and CUDA takes a bit of effort to get setup in your development environment, so it would depend how important this really is!
Good luck!