How to implement a compression table in CUDA?

How to implement a compression table in CUDA? - c++

I'm trying to optimize my C++ code, I don't know if there is a way to store a table in GPU with CUDA-C. The current code in C++ of the table is:
double m_alpha = 0.5;
unsigned char* compressionTable = new unsigned char[65536];
double denom = exp(m_alpha * log(65535.0)) / 255.0;
for (unsigned int i = 0; i < 65536; ++i)
compressionTable[i] = exp(m_alpha * log(i)) / denom;
After I access to this table in a loop as:
bmode[i][j] = compressionTable[round(abs(sH[i][j]))];
sH is the Hilbert transform (complex array) obtained of an array of short int type data (memory of compression table 216). The loop for the access is not a trivial problem, but my main question is the fast implementation of the compressionTable. I will appreciate any help.

If you really need to use a lookup table, on a GPU with SM 2.0 or higher, you should just put it in device memory and let the caches handle the memory traffic. For lookup tables, the other memory spaces don't work any better than L1/L2.
But this looks like a case where an optimization that works well on CPUs, is not needed at all on GPUs. CUDA hardware can compute single precision logarithms and exponentials with a latency of just 4 clock cycles. Rewrite your algorithm to do the computation in-line instead of using a lookup table. The resulting code will have less data-dependent performance, and the memory subsystem will be freed up to service memory traffic that's actually needed to run the kernel.

Related

How to do the modulus of complex number more efficiently in CUDA?

I was trying to do the Fast Fourier Transform to the data I collected. After the FFT operation, I wanted to calculate the modulus of the cufftComplex type data. Therefore, I summed the real part square and imaginary part square, and then took the square root of the summation. The code are provided below also the assignment of the grids and blocks:
dim3 dimBlock(256);
dim3 dimGrid(FFTlength / 256 * lines);
__global__ void modulus_kernel(int length, int lines, cufftComplex *PostFFTData, float* z)
{
unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;
if(x<length*lines)
z1[x] = sqrt(PostFFTData[x].x *PostFFTData[x].x + PostFFTData[x].y *PostFFTData[x].y);
__syncthreads();
}
The length of the PostFFData pointer array is 1024000, and the length and lines are 2048 and 500 respectively.
After I executed the code, I analyzed the timeline of the program by Nvidia Visual Profiler.
It shows that the modulus kernel took 0.367 ms to complete. Besides, the GPU card I used is GTX1080 and the CPU is i7-7700U. If I want to shorten the execution time, how should I do it?

If I want to shorten the execution time, how should I do it?
I can think of at least five things (in no particular order)
Get rid of the __syncthreads() call. It is unnecessary and will actively slow down your code
Pass the kernel length*lines as a single argument to the kernel. Why have an every thread do an integer multiply for a value which is constant?
Use a grid stride loop and launch only as many threads as can be resident on the device. Use the occupancy APIs to let the runtime do the hard thinking about the launch parameters for you.
If the problem size allows use #pragma unroll with a suggested unrolling length to hint to the compiler that the gride size loop can be partially unrolled. If that doesn't allow the compiler to generate a stream of floating point operations, then partially unroll the grid sized loop yourself.
Because you are passing single precision floating point values, use sqrtf, not sqrt. There are significant performance differences between double and single precision functions. If your application allows it, consider using less accurate versions of the sqrt function (prec-sqrt=false)

__syncthreads();
is useless since there is no sharing between threads

Why does my GTX 1080ti behave slower than GT 750M? [duplicate]

I was testing the new CUDA 8 along with the Pascal Titan X GPU and is expecting speed up for my code but for some reason it ends up being slower. I am on Ubuntu 16.04.
Here is the minimum code that can reproduce the result:
CUDASample.cuh
class CUDASample{
public:
void AddOneToVector(std::vector<int> &in);
};
CUDASample.cu
__global__ static void CUDAKernelAddOneToVector(int *data)
{
const int x = blockIdx.x * blockDim.x + threadIdx.x;
const int y = blockIdx.y * blockDim.y + threadIdx.y;
const int mx = gridDim.x * blockDim.x;
data[y * mx + x] = data[y * mx + x] + 1.0f;
}
void CUDASample::AddOneToVector(std::vector<int> &in){
int *data;
cudaMallocManaged(reinterpret_cast<void **>(&data),
in.size() * sizeof(int),
cudaMemAttachGlobal);
for (std::size_t i = 0; i < in.size(); i++){
data[i] = in.at(i);
}
dim3 blks(in.size()/(16*32),1);
dim3 threads(32, 16);
CUDAKernelAddOneToVector<<<blks, threads>>>(data);
cudaDeviceSynchronize();
for (std::size_t i = 0; i < in.size(); i++){
in.at(i) = data[i];
}
cudaFree(data);
}
Main.cpp
std::vector<int> v;
for (int i = 0; i < 8192000; i++){
v.push_back(i);
}
CUDASample cudasample;
cudasample.AddOneToVector(v);
The only difference is the NVCC flag, which for the Pascal Titan X is:
-gencode arch=compute_61,code=sm_61-std=c++11;
and for the old Maxwell Titan X is:
-gencode arch=compute_52,code=sm_52-std=c++11;
EDIT: Here are the results for running NVIDIA Visual Profiling.
For the old Maxwell Titan, the time for memory transfer is around 205 ms, and the kernel launch is around 268 us.
For the Pascal Titan, the time for memory transfer is around 202 ms, and the kernel launch is around an insanely long 8343 us, which makes me believe something is wrong.
I further isolate the problem by replacing cudaMallocManaged into good old cudaMalloc and did some profiling and observe some interesting result.
CUDASample.cu
__global__ static void CUDAKernelAddOneToVector(int *data)
{
const int x = blockIdx.x * blockDim.x + threadIdx.x;
const int y = blockIdx.y * blockDim.y + threadIdx.y;
const int mx = gridDim.x * blockDim.x;
data[y * mx + x] = data[y * mx + x] + 1.0f;
}
void CUDASample::AddOneToVector(std::vector<int> &in){
int *data;
cudaMalloc(reinterpret_cast<void **>(&data), in.size() * sizeof(int));
cudaMemcpy(reinterpret_cast<void*>(data),reinterpret_cast<void*>(in.data()),
in.size() * sizeof(int), cudaMemcpyHostToDevice);
dim3 blks(in.size()/(16*32),1);
dim3 threads(32, 16);
CUDAKernelAddOneToVector<<<blks, threads>>>(data);
cudaDeviceSynchronize();
cudaMemcpy(reinterpret_cast<void*>(in.data()),reinterpret_cast<void*>(data),
in.size() * sizeof(int), cudaMemcpyDeviceToHost);
cudaFree(data);
}
For the old Maxwell Titan, the time for memory transfer is around 5 ms both ways, and the kernel launch is around 264 us.
For the Pascal Titan, the time for memory transfer is around 5 ms both ways, and the kernel launch is around 194 us, which actually results in the performance increase I am hoping to see...
Why is Pascal GPU so slow on running CUDA kernels when cudaMallocManaged is used? It will be a travesty if I have to revert all my existing code that uses cudaMallocManaged into cudaMalloc. This experiment also shows that the memory transfer time using cudaMallocManaged is a lot slower than using cudaMalloc, which also feels like something is wrong. If using this results in a slow run time even the code is easier, this should be unacceptable because the whole purpose of using CUDA instead of plain C++ is to speed things up. What am I doing wrong and why am I observing this kind of result?

Under CUDA 8 with Pascal GPUs, managed memory data migration under a unified memory (UM) regime will generally occur differently than on previous architectures, and you are experiencing the effects of this. (Also see note at the end about CUDA 9 updated behavior for windows.)
With previous architectures (e.g. Maxwell), managed allocations used by a particular kernel call will be migrated all at once, upon launch of the kernel, approximately as if you called cudaMemcpy to move the data yourself.
With CUDA 8 and Pascal GPUs, data migration occurs via demand-paging. At kernel launch, by default, no data is explicitly migrated to the device(*). When the GPU device code attempts to access data in a particular page that is not resident in GPU memory, a page fault will occur. The net effect of this page fault is to:
Cause the GPU kernel code (the thread or threads that accessed the page) to stall (until step 2 is complete)
Cause that page of memory to be migrated from the CPU to the GPU
This process will be repeated as necessary, as GPU code touches various pages of data. The sequence of operations involved in step 2 above involves some latency as the page fault is processed, in addition to the time spent to actually move the data. Since this process will move data a page at a time, it may be signficantly less efficient than moving all the data at once, either using cudaMemcpy or else via the pre-Pascal UM arrangement that caused all data to be moved at kernel launch (whether it was needed or not, and regardless of when the kernel code actually needed it).
Both approaches have their pros and cons, and I don't wish to debate the merits or various opinions or viewpoints. The demand-paging process enables a great many important features and capabilities for Pascal GPUs.
This particular code example, however, does not benefit. This was anticipated, and so the recommended use to bring the behavior in line with previous (e.g. maxwell) behavior/performance is to precede the kernel launch with a cudaMemPrefetchAsync() call.
You would use the CUDA stream semantics to force this call to complete prior to the kernel launch (if the kernel launch does not specify a stream, you can pass NULL for the stream parameter, to select the default stream). I believe the other parameters for this function call are pretty self-explanatory.
With this function call before your kernel call, covering the data in question, you should not observe any page-faulting in the Pascal case, and the profile behavior should be similar to the Maxwell case.
As I mentioned in the comments, if you had created a test case that involved two kernel calls in sequence, you would have observed that the 2nd call runs at approximately full speed even in the Pascal case, since all of the data has already been migrated to the GPU side through the first kernel execution. Therefore, the use of this prefetch function should not be considered mandatory or automatic, but should be used thoughtfully. There are situations where the GPU may be able to hide the latency of page-faulting to some degree, and obviously data already resident on the GPU does not need to be prefetched.
Note that the "stall" referred to in step 1 above is possibly misleading. A memory access by itself does not trigger a stall. But if the data requested is actually needed for an operation, e.g. a multiply, then the warp will stall at the multiply operation, until the necessary data becomes available. A related point, then, is that demand-paging of data from host to device in this fashion is just another "latency" that the GPU can possibly hide in it's latency-hiding architecture, if there is sufficient other available "work" to attend to.
As an additional note, in CUDA 9, the demand-paging regime for pascal and beyond is only available on linux; the previous support for Windows advertised in CUDA 8 has been dropped. See here. On windows, even for Pascal devices and beyond, as of CUDA 9, the UM regime is the same as maxwell and prior devices; data is migrated to the GPU en-masse, at kernel launch.
(*) The assumption here is that data is "resident" on the host, i.e. already "touched" or initialized in CPU code, after the managed allocation call. The managed allocation itself creates data pages associated with the device, and when CPU code "touches" these pages, the CUDA runtime will demand-page the necessary pages to be resident in host memory, so that the CPU can use them. If you perform an allocation but never "touch" the data in CPU code (an odd situation, probably) then it will actually already be "resident" in device memory when the kernel runs, and the observed behavior will be different. But that is not the case in view for this particular example/question.
Additional information is available in this blog article.

I can reproduce this in three programms on a 1060 and a 1080. As example i use a voulme render with procedural transferfunction which was nearly interactive real time on a 960 but on a 1080 is a slight show. All data are stored in read only textures and only my transferfunctions are in Managed Memory. In difference to my other code the volume render runs especially slow, this is becaus in differece to my other code my transferfunctions are passed from the kernel to other device methods.
I belive that it is not only the calling of kernels with cudaMallocManaged data. My expierence go to that every call of a kernel or device methode has this behavior and the effect adds up. Also the basis of the volume render is in parts the provided CudaSample without Managed Memory, which runs as expected on Maxwell an pascal GPUs (1080, 1060,980Ti,980,960).
I just yesterday found this bug, because we changed all of oure reaserch systems to pascal. I will profile my software in the next days on a 980 in comapre to a 1080. I'm not yet sure if i should report a bug in the NVIDIA developer zone.

it is a BUG of NVIDIA on Windows Systems witch occurs with PASCAL architecture.
I know this since a few days, but could not write it here because i was on vacation without internet connection.
For details see the comments of: https://devblogs.nvidia.com/parallelforall/unified-memory-cuda-beginners/
where Mark Harris from NVIDIA confirms the Bug. It should be corrected with CUDA 9. He also tells that it should be communicated to Microsoft to help the caus. But i don't found a suitable Microsoft Bug Report Page till now.

How can I improve the perfomance of my OpenMP code?

I am currently trying to improve parallel performance on my Code and I am still new to OpenMP. I have to iterate over a large container, in each iteration reading from multiple entries and writing a result to a single entry. Below is a very minmal Code example of what I am trying to do.
data is a pointer to an array, where a lot of datapoints are stored. Before the parallel region I create an Array newData, so can use data as read-only and newData as write-only, afterwards I throw the old data away and use newDatafor further calculations.
To my understanding data and newDataare shared between threads and everything declared inside the parallel region is private.
Can reading from databy multiple threads cause performance issues?
I am using #critical for assigning a new value to an element of newData to avoid race conditions. Is this necessary, since I access every element of newDataonly once and never by multiple threads?
Also I am not sure about scheduling. Do I have to specify if I want a static or dynamic schedule? Can I use nowait since all threads are idependent of each other?
array *newData = new array;
omp_set_num_threads (threads);
#pragma omp parallel
{
#pragma omp for
for (int i = 0; i < range; i++)
{
double middle = (*data)[i];
double previous = (*data)[i-1];
double next = (*data)[i+1];
double new_value = (previous + middle + next) / 3.0;
#pragma omp critical(assignment)
(*newData)[i] = new_value;
}
}
delete data;
data = newData;
I am aware that in the first and last iteration previous and next can not be read from data, in the real code this is taken care of but for this minimal example you get the idea of reading multiple times from data.

First of all, get rid of all unnecessary dependencies. #pragma omp critical(assignment) is not necessary because each index of (*newData) is only written to once per loop, so there's no race condition.
Your code could now look like this:
#pragma omp parallel for
for (int i = 0; i < range; i++)
(*newData)[i] = ((*data)[i-1] + (*data)[i] + (*data)[i+1]) / 3.0;
Now we're looking for bottlenecks. The list of potential candidates I came up with is this:
Slow division
Cache thrashing
ILP (Instruction level parallelism)
Memory bandwith limitations
Hidden dependencies
So let's analyze them further.
Slow division:
It takes some CPUs forever to calculate double/double. To know how long and what througput your CPU has, you have to look at its specs. Maybe replacing /3.0 with *0.3333.. might help, but maybe your compiler does this already. Using extended instruction sets (like SSE/AVX) you might shedule several divisions/multiplications at once.
Cache thrashing:
Because your CPU has to load/store one cache line at a time there could be conflicts. Imagine if thread 1 tries to write to (*newdata)[1] and thread 2 to (*newdata)[2] and they are on the same cache line. Now one of them has to wait for the other. You could resolve this with #pragma omp parallel for schedule(static, 64).
ILP:
CPUs can schedule multiple operations into a pipeline if the operations are independent. For this to happen you have to unroll your loop. This could look like this:
assert(range % 4 == 0);
#pragma omp parallel for
for (int i = 0; i < range/4; i++) {
(*newData)[i*4+0] = ((*data)[i*4-1] + (*data)[i*4+0] + (*data)[i*4+1]) / 3.0;
(*newData)[i*4+1] = ((*data)[i*4+0] + (*data)[i*4+1] + (*data)[i*4+2]) / 3.0;
(*newData)[i*4+2] = ((*data)[i*4+1] + (*data)[i*4+2] + (*data)[i*4+3]) / 3.0;
(*newData)[i*4+3] = ((*data)[i*4+2] + (*data)[i*4+3] + (*data)[i*4+4]) / 3.0;
}
Memory bandwith limitations:
For your very simple loop think about this. How much memory do you have to load and how long will your CPU be busy processing it. You're loading about 1 cache line and computing some dereferences, some pointer addition, two additions and one division. Which limit you hit depends on your CPU specs.
Now consider cache locality. Can you modify your code to make better use of the cache? If one thread gets i=3 in one loop-iteration, and i=7 in the next, you have to reload 3 (*data)'s. But if you would go from i=3 to i=4, you might not have to load anything, because (*data)[i+1] was in the cacheline previously loaded. You save some RAM bandwith. To make use of this, unroll the loop. Also using float instead of double increases this chance.
Hidden dependencies:
Now this part I personally find very tricky. Sometimes your compiler isn't shure it can reuse some data, because it doesn't know it hasn't changed. Using const helps the compiler. But sometimes you need a restrict to give the compiler the right hint. But I don't understand this well enough to explain it.
So here is what I would try:
const double ONETHIRD = 1.0 / 3.0;
assert(range % 4 == 0);
#pragma omp parallel for schedule(static, 1024)
for (int i = 0; i < range/4; i++) {
(*newData)[i*4+0] = ((*data)[i*4-1] + (*data)[i*4+0] + (*data)[i*4+1]) * ONETHIRD;
(*newData)[i*4+1] = ((*data)[i*4+0] + (*data)[i*4+1] + (*data)[i*4+2]) * ONETHIRD;
(*newData)[i*4+2] = ((*data)[i*4+1] + (*data)[i*4+2] + (*data)[i*4+3]) * ONETHIRD;
(*newData)[i*4+3] = ((*data)[i*4+2] + (*data)[i*4+3] + (*data)[i*4+4]) * ONETHIRD;
}
And then benchmark. Benchmark some more, and benchmark some more. Only benchmarks will show you which tricks help.
PS: One more thing to consider. If you see your program hitting the memory bandwith hard. You could consider changing the algorithm. Maybe fuse two steps into one. Like going from
b[i] := (a[i-1] + a[i] + a[i+1]) / 3.0
to
d[i] := (n[i-1] + n[i] + n[i+1]) / 3.0 = (a[i-2] + 2.0 * a[i-1] + 3.0 * a[i] + 2.0 * a[i+1] + a[i+1]) / 3.0. I think the reason for this you will find out yourself.
Have fun optimizing ;-)

Reading an array by multiple threads usually does no harm.
You only need a critical section if multiple threads work on the exact same piece of data, here each thread accesses a different part of the array so you dont need it. Critical sections are very bad for performance so only use them if absolutely necessary. Often they can be replaced by atomic actions:
openMP, atomic vs critical?
Like a critical section, they dont make sense if each thread accesses different data.
For the scheduler its best to test them each and measure the performance as predictions about performance are often wrong. Also try different chunk sizes.
Some other things that might help:
Measuring performance is often interferred by other tasks on your pc so take multiple measurements and take their minimum (except if the input is different each time, then take the average and do more measurements).
Do you really need double precision? Floats are a lot faster.
edit: nowait is for multiple independent for loops: https://msdn.microsoft.com/en-us/library/ek5st0e3.aspx

I assume you are trying to do some kind of convolution or median blur with 1D array. The short answer is: stick to default schedule strategy, and get rid of critical at all.
As I can tell, you are a quit newbie to parallelism, it's a little bit confusion to deal with OpenMP directives, like nowait/private/reduction/critical/atomic/single, etc. I think what you need is a well written textbook to clarify various concept. If you had a sound knowledge, a hour of learning OpenMP could be enough to deal with most daily programming.

Will matrix multiplication using for loops decrease performance?

Currently I'm working on a program that uses matrices. I came up with this nested loop to multiply two matrices:
// The matrices are 1-dimensional arrays
for (int i = 0; i < 4; i++)
for (int j = 0; j < 4; j++)
for (int k = 0; k < 4; k++)
result[i * 4 + j] += M1[i * 4 + k] * M2[k * 4 + j];
The loop works. My question is: will this loop be slower compared to writing it all out manually like this:
result[0] = M1[0]*M2[0] + M1[1]*M2[4] + M1[2]*M2[8] + M1[3]*M2[12];
result[1] = M1[0]*M2[1] + M1[1]*M2[5] + M1[2]*M2[9] + M1[4]*M2[13];
result[2] = ... etc.
Because in the nested loop, the array positions are calculated and in the second method, they do not.
Thanks.

As with so many things, "it depends", but in this instance I would tend toward the second, expanded form performing just about the same. Any modern compiler will unroll appropriate loops for you, and take care of it.
Two points perhaps worth making:
The second approach is uglier, is more prone to errors and tedious to write/maintain.
This is a nice example of 'premature optimization' (AKA the root of all evil). Do you know if this section is a bottleneck? Is this really the most intensive part of the code? By optimizing so early we incur everything in point #1 for what amounts to a hunch if we haven't bench marked our code.

Your compiler might already do this, take a look at loop unrolling.
Let the compiler do the guessing and the heavy work, stick to the clean code, and as always, measure your performance.

I don't think the loop will be slower. You are accessing the memory of the M1 and M2 arrays in the same way in both instances i.e. . If you want to make the "manual" version faster then use scalar replacement and do the computation on registers e.g.
double M1_0 = M1[0];
double M2_0 = M2[0];
result[0] = M1_0*M2_0 + ...
but you can use scalar replacement within the loop as well. You can do it if you do blocking and loop unrolling (in fact your triple loop looks like a blocking version of the MMM).
What you are trying to do is to speed up the program by improving locality i.e. better use of the memory hierarchy and better locality.

Assuming that you are running code on Intel processors or compatible (AMD) you may actually want to switch to assembly language to do heavy matrix computations. Luckily, you have the Intel-IPP library that does the actual work for you using advanced processor technology and selecting what is expected to be the fastest algorithm depending on your processor.
The IPP includes all the necessary matrix computations that you'd possibly need. The only problem you may encounter is the order in which you created your matrices. You may have to re-organize the order to make it easier to use the IPP functions you'd like to use.
Note that in regard to your two code examples, the second one will be faster because you avoid the += operator which is a read / modify / write cycle and that's generally slow (not only that, it requires the result matrix to be all zeroes to start with whereas the second example does not require clearing the output first), although your matrices are likely to fit in the cache... but, the processors are optimized to read input data in sequence (a[0], a1, a[2], a[3], ...) and also to write that data back in sequence. If you can write your algorithm to be as close as possible to such a sequence, all the better. Don't get me wrong, I know that matrix multiplications cannot be done in sequence. But if you think of that to do your optimization, you'll achieve better results (i.e. change the order in which your matrices are saved in memory could be one of them).

C++ Adding 2 arrays together quickly

Given the arrays:
int canvas[10][10];
int addon[10][10];
Where all the values range from 0 - 100, what is the fastest way in C++ to add those two arrays so each cell in canvas equals itself plus the corresponding cell value in addon?
IE, I want to achieve something like:
canvas += another;
So if canvas[0][0] =3 and addon[0][0] = 2 then canvas[0][0] = 5
Speed is essential here as I am writing a very simple program to brute force a knapsack type problem and there will be tens of millions of combinations.
And as a small extra question (thanks if you can help!) what would be the fastest way of checking if any of the values in canvas exceed 100? Loops are slow!

Here is an SSE4 implementation that should perform pretty well on Nehalem (Core i7):
#include <limits.h>
#include <emmintrin.h>
#include <smmintrin.h>
static inline int canvas_add(int canvas[10][10], int addon[10][10])
{
__m128i * cp = (__m128i *)&canvas[0][0];
const __m128i * ap = (__m128i *)&addon[0][0];
const __m128i vlimit = _mm_set1_epi32(100);
__m128i vmax = _mm_set1_epi32(INT_MIN);
__m128i vcmp;
int cmp;
int i;
for (i = 0; i < 10 * 10; i += 4)
{
__m128i vc = _mm_loadu_si128(cp);
__m128i va = _mm_loadu_si128(ap);
vc = _mm_add_epi32(vc, va);
vmax = _mm_max_epi32(vmax, vc); // SSE4 *
_mm_storeu_si128(cp, vc);
cp++;
ap++;
}
vcmp = _mm_cmpgt_epi32(vmax, vlimit); // SSE4 *
cmp = _mm_testz_si128(vcmp, vcmp); // SSE4 *
return cmp == 0;
}
Compile with gcc -msse4.1 ... or equivalent for your particular development environment.
For older CPUs without SSE4 (and with much more expensive misaligned loads/stores) you'll need to (a) use a suitable combination of SSE2/SSE3 intrinsics to replace the SSE4 operations (marked with an * above) and ideally (b) make sure your data is 16-byte aligned and use aligned loads/stores (_mm_load_si128/_mm_store_si128) in place of _mm_loadu_si128/_mm_storeu_si128.

You can't do anything faster than loops in just C++. You would need to use some platform specific vector instructions. That is, you would need to go down to the assembly language level. However, there are some C++ libraries that try to do this for you, so you can write at a high level and have the library take care of doing the low level SIMD work that is appropriate for whatever architecture you are targetting with your compiler.
MacSTL is a library that you might want to look at. It was originally a Macintosh specific library, but it is cross platform now. See their home page for more info.

The best you're going to do in standard C or C++ is to recast that as a one-dimensional array of 100 numbers and add them in a loop. (Single subscripts will use a bit less processing than double ones, unless the compiler can optimize it out. The only way you're going to know how much of an effect there is, if there is one, is to test.)
You could certainly create a class where the addition would be one simple C++ instruction (canvas += addon;), but that wouldn't speed anything up. All that would happen is that the simple C++ instruction would expand into the loop above.
You would need to get into lower-level processing in order to speed that up. There are additional instructions on many modern CPUs to do such processing that you might be able to use. You might be able to run something like this on a GPU using something like Cuda. You could try making the operation parallel and running on several cores, but on such a small instance you'll have to know how caching works on your CPU.
The alternatives are to improve your algorithm (on a knapsack-type problem, you might be able to use dynamic programming in some way - without more information from you, we can't tell you), or to accept the performance. Tens of millions of operations on a 10 by 10 array turn into hundreds of billions of operations on numbers, and that's not as intimidating as it used to be. Of course, I don't know your usage scenario or performance requirements.

Two parts: first, consider your two-dimensional array [10][10] as a single array [100]. The layout rules of C++ should allow this. Second, check your compiler for intrinsic functions implementing some form of SIMD instructions, such as Intel's SSE. For example Microsoft supplies a set. I believe SSE has some instructions for checking against a maximum value, and even clamping to the maximum if you want.

Here is an alternative.
If you are 100% certain that all your values are between 0 and 100, you could change your type from an int to a uint8_t. Then, you could add 4 elements together at once of them together using uint32_t without worrying about overflow.
That is ...
uint8_t array1[10][10];
uint8_t array2[10][10];
uint8_t dest[10][10];
uint32_t *pArr1 = (uint32_t *) &array1[0][0];
uint32_t *pArr2 = (uint32_t *) &array2[0][0];
uint32_t *pDest = (uint32_t *) &dest[0][0];
int i;
for (i = 0; i < sizeof (dest) / sizeof (uint32_t); i++) {
pDest[i] = pArr1[i] + pArr2[i];
}
It may not be the most elegant, but it could help keep you from going to architecture specific code. Additionally, if you were to do this, I would strongly recommend you comment what you are doing and why.

You should check out CUDA. This kind of problem is right up CUDA's street. Recommend the Programming Massively Parallel Processors book.
However, this does require CUDA capable hardware, and CUDA takes a bit of effort to get setup in your development environment, so it would depend how important this really is!
Good luck!

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js