Verfiy the number of times a cuda kernel is called - c++

Say you have a cuda kernel that you want to run 2048 times, so you define your kernel like this:
__global__ void run2048Times(){ }
Then you call it from your main code:
run2048Times<<<2,1024>>>();
All seems well so far. However now say for debugging purposes when you're calling the kernel millions of times, you want to verify that your actually calling the Kernel that many times.
What I did was pass a pointer to the kernel and ++'d the pointer every time the kernel ran.
__global__ void run2048Times(int *kernelCount){
kernelCount[0]++; // Add to the pointer
}
However when I copied that pointer back to the main function I get "2".
At first it baffeled me, then after 5 minutes of coffee and pacing back and forth I realized this probably makes sense because the cuda kernel is running 1024 instances of itself at the same time, which means that the kernels overwrite the "kernelCount[0]" instead of truly adding to it.
So instead I decided to do this:
__global__ void run2048Times(int *kernelCount){
// Get the id of the kernel
int id = blockIdx.x * blockDim.x + threadIdx.x;
// If the id is bigger than the pointer overwrite it
if(id > kernelCount[0]){
kernelCount[0] = id;
}
}
Genius!! This was guaranteed to work I thought. Until I ran it and got all sorts of numbers between 0 and 2000.
Which tells me that the problem mentioned above still happens here.
Is there any way to do this, even if it involves forcing the kernels to pause and wait for each other to run?

Assuming this is a simplified example, and you are not in fact trying to do profiling as others have already suggested, but want to use this in a more complex scenario, you can achieve the result you want with atomicAdd, which will ensure that the increment operation is executed as a single atomic operation:
__global__ void run2048Times(int *kernelCount){
atomicAdd(kernelCount, 1); // Add to the pointer
}
Why your solutions didn't work:
The problem with your first solution is that it gets compiled into the following PTX code (see here for description of PTX instructions):
ld.global.u32 %r1, [%rd2];
add.s32 %r2, %r1, 1;
st.global.u32 [%rd2], %r2;
You can verify this by calling nvcc with the --ptx option to only generate the intermediate representation.
What can happen here is the following timeline, assuming you launch 2 threads (Note: this is a simplified example and not exactly how GPUs work, but it is enough to illustrate the problem):
thread 0 reads 0 from kernelCount
thread 1 reads 0 from kernelCount
thread 0 increases it's local copy by 1
thread 0 stores 1 back to kernelCount
thread 1 increases it's local copy by 1
thread 1 stores 1 back to kernelCount
and you end up with 1 even though 2 threads were launched.
Your second solution is wrong even if the threads are launched sequentially because thread indexes are 0-based. So I'll assume you wanted to do this:
__global__ void run2048Times(int *kernelCount){
// Get the id of the kernel
int id = blockIdx.x * blockDim.x + threadIdx.x;
// If the id is bigger than the pointer overwrite it
if(id + 1 > kernelCount[0]){
kernelCount[0] = id + 1;
}
}
This will compile into:
ld.global.u32 %r5, [%rd1];
setp.lt.s32 %p1, %r1, %r5;
#%p1 bra BB0_2;
add.s32 %r6, %r1, 1;
st.global.u32 [%rd1], %r6;
BB0_2:
ret;
What can happen here is the following timeline:
thread 0 reads 0 from kernelCount
thread 1 reads 0 from kernelCount
thread 1 compares 0 to 1 + 1 and stores 2 into kernelCount
thread 0 compares 0 to 0 + 1 and stores 1 into kernelCount
You end up having the wrong result of 1.
I suggest you pick up a good parallel programming / CUDA book if you want to better understand problems with synchronization and non-atomic operations.
EDIT:
For completeness, the version using atomicAdd compiles into:
atom.global.add.u32 %r1, [%rd2], 1;

It seems like the only point of that counter is to do profiling (i.e. analyse how the code runs) rather than to actually count something (i.e. no functional benefit to the program).
There are profiling tools available designed for this task. For example, nvprof gives the number of calls, as well as some time metrics for each kernel in your codebase.

Related

Strange behaviour in multithread analysis

for a university project we are implementing an algorithm capable of bruteforcing on an AES key that we assume is partially known.
We have implemented several versions including one that exploits the multithreading mechanism in C++.
The implementation is done by allocating a variable number of threads, to be passed as input at launch, and dividing the key space equally for each thread that will cycle through the respective range attempting each key. De facto the implementation works, as it succeeds in finding the key for any combination #bitsToHack/#threads but returns strange timing results.
//Structs for threads and respective data
pthread_t threads[num_of_threads];
struct bf_data td [num_of_threads];
int rc;
//Space division
uintmax_t index = pow (BASE_NUMBER, num_bits_to_hack);
uintmax_t step = index/num_of_threads;
if(sem_init(&s, 1, 0)!=0){
printf("Error during semaphore initialization\n");
return -1;
}
for(int i = 0; i < num_of_threads; i++){
//Structure initialization
td[i].ciphertext = ciphertext;
td[i].hacked_key = hacked_key;
td[i].iv_aes = iv_aes;
td[i].key = key_aes;
td[i].num_bits_to_hack = num_bits_to_hack;
td[i].plaintext = plaintext;
td[i].starting_point = step*i;
td[i].step = step;
td[i].num_of_threads = num_of_threads;
if(DEBUG)
printf("Starting point for thread %d is: %lu, using step: %lu\n", i , td[i].starting_point, td[i].step);
rc = pthread_create(&threads[i], NULL, decryption_brute_force, (void*)&td[i]);
if (rc){
cout << "Error:unable to create thread," << rc << endl;
exit(-1);
}
}
sem_wait(&s);
for(int i = 0; i < num_of_threads; i++){
pthread_join(threads[i], NULL);
}
For the decryption_brute_force function (The body of each thread):
void* decryption_brute_force(void* data){
** Copy data on local thread memory
** Build the key to begin the search from starting point
** for each key from starting_point to starting_point + step
** Try decryption
** if obtained plaintext corresponds to the expected one
** Print results, wake up main thread and terminate
** else
** increment the key and continue
}
To conclude the project we intended to conduct a study of the optimal number of threads expecting an increase in performance as the number of threads increased up to a threshold, after which the system would no longer benefit from the increase in threads assigned to it.
At the end of the analysis (a simulation lasting about 9 hours), the results obtained were as follows in figure.
Click here to see the plot.
We cannot understand why 8 threads performs better than 16. Could it be due to the CPU architecture? Could it be able to schedule 32 and 8 threads better than 16?
From comments, I think it could be the linear-search pattern in each thread yields to different results for different number of threads. Because when you double the threads, the actual linear point to find in a thread may shift to a further point. But once you double again, it can not go much further due to too many threads. Because you said you are using only same encrypted data always. Did you try different inputs?
this variable is integer (so it may not be exact distribution)
^
8 threads & step=7 (56 work total)
index-16 (0-based)
v
01234567 89abcdef 01234567 89abcdef
| | |. | ...
500 seconds as its the first loop iteration
16 threads & step=3 (56 work total)
index-16 again, but at second-iteration now
v
012 345 678 9ab cde f01 234 567 8
| | | | | | . | | | ...
1000 seconds as it finds only after second iteration in the thread
Another example with 2 threads and 3 threads:
x to found at 51-th element of 100-element-work:
2 threads
| |x(1st iteration) |
3 threads
| |........x | |
5x slower than 2 threads

chrono give different measures at same function

i am trying to measure the execution time.
i'm on windows 10 and use gcc compiler.
start_t = chrono::system_clock::now();
tree->insert();
end_t = chrono::system_clock::now();
rslt_period = chrono::duration_cast<chrono::nanoseconds>(end_t - start_t);
this is my code to measure time about bp_w->insert()
the function insert work internally like follow (just pseudo code)
insert(){
_load_node(node);
// do something //
_save_node(node, addr);
}
_save_node(n){
ofstream file(name);
file.write(n);
file.close();
}
_load_node(n, addr){
ifstream file(name);
file.read_from(n, addr);
file.close();
}
the actual results is,
read is number of _load_node executions.
write is number of _save_node executions.
time is nano secs.
read write time
1 1 1000000
1 1 0
2 1 0
1 1 0
1 1 0
1 1 0
2 1 0
1 1 1004000
1 1 1005000
1 1 0
1 1 0
1 1 15621000
i don't have any idea why this result come and want to know.
What you are trying to measure is ill-defined.
"How long did this code take to run" can seem simple. In practice, though, do you mean "how many CPU cycles my code took" ? Or how many cycles between my program and the other running programs ? Do you account for the time to load/unload it on the CPU ? Do you account for the CPU being throttled down when on battery ? Do you want to account for the time to access the main clock located on the motherboard (in terms of computation that is extremely far).
So, in practice timing will be affected by a lot of factors and the simple fact of measuring it will slow everything down. Don't expect nanosecond accuracy. Micros, maybe. Millis, certainly.
So, that leaves you in a position where any measurement will fluctuate a lot. The sane way is to average it out over multiple measurement. Or, even better, do the same operation (on different data) a thousand (million?) times and divide the results by a thousand.
Then, you'll get significant improvement on accuracy.
In code:
start_t = chrono::system_clock::now();
for(int i = 0; i < 1000000; i++)
tree->insert();
end_t = chrono::system_clock::now();
You are using the wrong clock. system_clock is not useful for timing intervals due to low resolution and its non-monotonic nature.
Use steady_clock instead. it is guaranteed to be monotonic and have a low enough resolution to be useful.

Open MP bottleneck issue

I was trying to observe a basic openMP based parallelism with the following code,
#include<stdio.h>
#include<omp.h>
#include<stdlib.h>
#include <time.h>
int main(){
long i;
long x[] = {0,0,0,0};
omp_set_num_threads(4);
clock_t time=clock();
#pragma omp parallel for
for(i=0;i<100000000;i++){
x[omp_get_thread_num()]++;
}
double time_taken = (double)(clock() - time) / CLOCKS_PER_SEC;
printf("%ld %ld %ld %ld %lf\n",x[0],x[1],x[2],x[3],time_taken);
}
Now, I am using a quad core i5 processor. I have checked 4 different values of the threads. The following results are found,
Set: omp_set_num_threads(1);
Out: 100000000 0 0 0 0.203921
Set: omp_set_num_threads(2);
Out: 50000000 50000000 0 0 0.826322
Set: omp_set_num_threads(3);
Out: 33333334 33333333 33333333 0 1.448936
Set: omp_set_num_threads(4);
Out: 25000000 25000000 25000000 25000000 1.919655
The x array values are accurate. But the time is surprisingly increasing in the increased number of threads. I can not get any explanation/justification behind this phenomenon. Is it somehow, omp_get_thread_num() function that is atomic in nature ? Or something else that I am missing out ?
Compiling as, gcc -o test test.c -fopenmp
UPDATE
So, as per the suggestion in the accepted answer, I have modified the code as follows,
#include<stdio.h>
#include<omp.h>
#include<stdlib.h>
int main(){
long i, t_id, fact=1096;
long x[fact*4];
x[0]=x[fact]=x[2*fact]=x[3*fact]=0;
omp_set_num_threads(4);
double time = omp_get_wtime();
#pragma omp parallel for private(t_id)
for(i=0;i<100000000;i++){
t_id = omp_get_thread_num();
x[t_id*fact]++;
}
double time_taken = omp_get_wtime() - time;
printf("%ld %ld %ld %ld %lf\n",x[0],x[fact],x[2*fact],x[3*fact],time_taken);
}
Now, the results are understandable,
Set: omp_set_num_threads(1)
Out: 100000000 0 0 0 0.250205
Set: omp_set_num_threads(2)
Out: 50000000 50000000 0 0 0.154980
Set: omp_set_num_threads(3)
Out: 33333334 33333333 33333333 0 0.078874
Set: omp_set_num_threads(4)
Out: 25000000 25000000 25000000 25000000 0.061155
Therefore, it was about the cache line size as explained in the accepted answer. Have a look there to get the answer.
Note that the 4 integers that you are operating on lie very closely together, probably on one cache line. Since cache lines are loaded into the CPU cache in one go, each thread needs to ensure that it has the latest version of that cache line. Since all threads want to modify (and not just read) that one cache line, they are constantly invalidating one another's copy. Welcome to false sharing!
To solve this problem, ensure that the integers are (physically) far enough apart from one another, e.g., by allocating structures that fill (at least) one full cache line for each thread to work with.
When executing your sample program using 4 threads on one of my machines, I got the following result:
25000000 25000000 25000000 25000000 5.049694
When modifying the program, such that the array has 4096 elements, and using the elements 0, 1024, 2048 and 3072 (which ensures enough distance), the program runs a lot faster:
25000000 25000000 25000000 25000000 1.617231
Note that although you are counting the processor time used by the whole process, without false sharing, the time should not increase significantly, but rather be more or less constant (there is some additional locking involved, but it should not usually be on the order of a 10x increase). In fact, the performance boost shown above also translates into wall-clock time (~1.25 seconds to ~500ms).
The reason for your observation, as noted by gha.st, are false sharing and the properties of the clock function.
For this reason x[omp_get_thread_num()], is an anti-pattern. Sure, you can leverage your new knowledge by adding a stride in the memory. But this also encodes hardware-specific properties (i.e. cache line size) into your data structures. This can lead to nasty code that is difficult to understand and still has bad performance portability.
The idiomatic solution is to use either of the following:
If you are only interested in an aggregate, use a reduction clause, i.e.:
long x = 0;
#pragma omp parallel for reduction(+:x)
for(i=0;i<100000000;i++){
x++;
}
// total sum is now in x
If you need individual values within the thread, just use a private variable, preferably implicitly by scope. Or if you need particular initialization from outside the construct use firstprivate.
#pragma omp parallel
{
long local_x = 0; // implicitly private by scope!
#pragma omp for
for(i=0;i<100000000;i++) {
local_x++;
}
// can now do something with the the sum of the current thread.
}
And if you need the per-thread results outside, you can just use the second form and write the result once:
#pragma omp parallel
{
long local_x = 0; // implicitly private by scope!
#pragma omp for
for(i=0;i<100000000;i++) {
local_x++;
}
x[omp_get_thread_num()] = local_x;
}
That's not to say that you never need to design a datastructure with false-sharing in mind. But it's not as common as you might think.

No speedup for vector sums with threading

I have a C++ program which basically performs some matrix calculations. For these I use LAPACK/BLAS and usually link to the MKL or ACML depending on the platform. A lot of these matrix calculations operate on different independent matrices and hence I use std::thread's to let these operations run in parallel. However, I noticed that I have no speed-up when using more threads. I traced the problem down to the daxpy Blas routine. It seems that if two threads are using this routine in parallel each thread takes twice the time, even though the two threads operate on different arrays.
The next thing I tried was writing a new simple method to perform vector additions to replace the daxpy routine. With one thread this new method is as fast as the BLAS routine, but, when compiling with gcc, it suffers from the same problems as the BLAS routine: doubling the number of threads running parallel also doubles the amount of time each threads needs, so no speed-up is gained. However, using the Intel C++ Compiler this problems vanishes: with increasing number of threads the time a single thread needs is constant.
However, I need to compile as well on systems where no Intel compiler is available. So my questions are: why is there no speed-up with the gcc and is there any possibility of improving the gcc performance?
I wrote a small program to demonstrate the effect:
// $(CC) -std=c++11 -O2 threadmatrixsum.cpp -o threadmatrixsum -pthread
#include <iostream>
#include <thread>
#include <vector>
#include "boost/date_time/posix_time/posix_time.hpp"
#include "boost/timer.hpp"
void simplesum(double* a, double* b, std::size_t dim);
int main() {
for (std::size_t num_threads {1}; num_threads <= 4; num_threads++) {
const std::size_t N { 936 };
std::vector <std::size_t> times(num_threads, 0);
auto threadfunction = [&](std::size_t tid)
{
const std::size_t dim { N * N };
double* pA = new double[dim];
double* pB = new double[dim];
for (std::size_t i {0}; i < N; ++i){
pA[i] = i;
pB[i] = 2*i;
}
boost::posix_time::ptime now1 =
boost::posix_time::microsec_clock::universal_time();
for (std::size_t n{0}; n < 1000; ++n){
simplesum(pA, pB, dim);
}
boost::posix_time::ptime now2 =
boost::posix_time::microsec_clock::universal_time();
boost::posix_time::time_duration dur = now2 - now1;
times[tid] += dur.total_milliseconds();
delete[] pA;
delete[] pB;
};
std::vector <std::thread> mythreads;
// start threads
for (std::size_t n {0} ; n < num_threads; ++n)
{
mythreads.emplace_back(threadfunction, n);
}
// wait for threads to finish
for (std::size_t n {0} ; n < num_threads; ++n)
{
mythreads[n].join();
std::cout << " Thread " << n+1 << " of " << num_threads
<< " took " << times[n]<< "msec" << std::endl;
}
}
}
void simplesum(double* a, double* b, std::size_t dim){
for(std::size_t i{0}; i < dim; ++i)
{*(++a) += *(++b);}
}
The outout with gcc:
Thread 1 of 1 took 532msec
Thread 1 of 2 took 1104msec
Thread 2 of 2 took 1103msec
Thread 1 of 3 took 1680msec
Thread 2 of 3 took 1821msec
Thread 3 of 3 took 1808msec
Thread 1 of 4 took 2542msec
Thread 2 of 4 took 2536msec
Thread 3 of 4 took 2509msec
Thread 4 of 4 took 2515msec
The outout with icc:
Thread 1 of 1 took 663msec
Thread 1 of 2 took 674msec
Thread 2 of 2 took 674msec
Thread 1 of 3 took 681msec
Thread 2 of 3 took 681msec
Thread 3 of 3 took 681msec
Thread 1 of 4 took 688msec
Thread 2 of 4 took 689msec
Thread 3 of 4 took 687msec
Thread 4 of 4 took 688msec
So, with the icc the time needed for one thread perform the computations is constant (as I would have expected; my CPU has 4 physical cores) and with the gcc the time for one thread increases. Replacing the simplesum routine by BLAS::daxpy yields the same results for icc and gcc (no surprise, as most time is spent in the library), which are almost the same as the above stated gcc results.
The answer is fairly simple: Your threads are fighting for memory bandwidth!
Consider that you perform one floating point addition per 2 stores (one initialization, one after the addition) and 2 reads (in the addition). Most modern systems providing multiple cpus actually have to share the memory controller among several cores.
The following was run on a system with 2 physical CPU sockets and 12 cores (24 with HT). Your original code exhibits exactly your problem:
Thread 1 of 1 took 657msec
Thread 1 of 2 took 1447msec
Thread 2 of 2 took 1463msec
[...]
Thread 1 of 8 took 5516msec
Thread 2 of 8 took 5587msec
Thread 3 of 8 took 5205msec
Thread 4 of 8 took 5311msec
Thread 5 of 8 took 2731msec
Thread 6 of 8 took 5545msec
Thread 7 of 8 took 5551msec
Thread 8 of 8 took 4903msec
However, by simply increasing the arithmetic density, we can see a significant increase in scalability. To demonstrate, I changed your addition routine to also perform an exponentiation: *(++a) += std::exp(*(++b));. The result shows almost perfect scaling:
Thread 1 of 1 took 7671msec
Thread 1 of 2 took 7759msec
Thread 2 of 2 took 7759msec
[...]
Thread 1 of 8 took 9997msec
Thread 2 of 8 took 8135msec
Thread 3 of 8 took 10625msec
Thread 4 of 8 took 8169msec
Thread 5 of 8 took 10054msec
Thread 6 of 8 took 8242msec
Thread 7 of 8 took 9876msec
Thread 8 of 8 took 8819msec
But what about ICC?
First, ICC inlines simplesum. Proving that inlining happens is simple: Using icc, I have disable multi-file interprocedural optimization and moved simplesum into its own translation unit. The difference is astonishing. The performance went from
Thread 1 of 1 took 687msec
Thread 1 of 2 took 688msec
Thread 2 of 2 took 689msec
[...]
Thread 1 of 8 took 690msec
Thread 2 of 8 took 697msec
Thread 3 of 8 took 700msec
Thread 4 of 8 took 874msec
Thread 5 of 8 took 878msec
Thread 6 of 8 took 874msec
Thread 7 of 8 took 742msec
Thread 8 of 8 took 868msec
To
Thread 1 of 1 took 1278msec
Thread 1 of 2 took 2457msec
Thread 2 of 2 took 2445msec
[...]
Thread 1 of 8 took 8868msec
Thread 2 of 8 took 8434msec
Thread 3 of 8 took 7964msec
Thread 4 of 8 took 7951msec
Thread 5 of 8 took 8872msec
Thread 6 of 8 took 8286msec
Thread 7 of 8 took 5714msec
Thread 8 of 8 took 8241msec
This already explains why the library performs badly: ICC cannot inline it and therefore no matter what else causes ICC to perform better than g++, it will not happen.
It also gives a hint as to what ICC might be doing right here... What if instead of executing simplesum 1000 times, it interchanges the loops so that it
Loads two doubles
Adds them 1000 times (or even performs a = 1000 * b)
Stores two doubles
This would increase arithmetic density without adding any exponentials to the function... How to prove this? Well, to begin let us simply implement this optimization and see what happens! To analyse, we will look at the g++ performance. Recall our benchmark results:
Thread 1 of 1 took 640msec
Thread 1 of 2 took 1308msec
Thread 2 of 2 took 1304msec
[...]
Thread 1 of 8 took 5294msec
Thread 2 of 8 took 5370msec
Thread 3 of 8 took 5451msec
Thread 4 of 8 took 5527msec
Thread 5 of 8 took 5174msec
Thread 6 of 8 took 5464msec
Thread 7 of 8 took 4640msec
Thread 8 of 8 took 4055msec
And now, let us exchange
for (std::size_t n{0}; n < 1000; ++n){
simplesum(pA, pB, dim);
}
with the version in which the inner loop was made the outer loop:
double* a = pA; double* b = pB;
for(std::size_t i{0}; i < dim; ++i, ++a, ++b)
{
double x = *a, y = *b;
for (std::size_t n{0}; n < 1000; ++n)
{
x += y;
}
*a = x;
}
The results show that we are on the right track:
Thread 1 of 1 took 693msec
Thread 1 of 2 took 703msec
Thread 2 of 2 took 700msec
[...]
Thread 1 of 8 took 920msec
Thread 2 of 8 took 804msec
Thread 3 of 8 took 750msec
Thread 4 of 8 took 943msec
Thread 5 of 8 took 909msec
Thread 6 of 8 took 744msec
Thread 7 of 8 took 759msec
Thread 8 of 8 took 904msec
This proves that the loop interchange optimization is indeed the main source of the excellent performance ICC exhibits here.
Note that none of the tested compilers (MSVC, ICC, g++ and clang) will replace the loop with a multiplication, which improves performance by 200x in the single threaded and 15x in the 8-threaded cases. This is due to the fact that the numerical instability of the repeated additions may cause wildly differing results when replaced with a single multiplication. When testing with integer data types instead of floating point data types, this optimization happens.
How can we force g++ to perform this optimization?
Interestingly enough, the true killer for g++ is not an inability to perform loop interchange. When called with -floop-interchange, g++ can perform this optimization as well. But only when the odds are significantly stacked into its favor.
Instead of std::size_t all bounds were expressed as ints. Not long, not unsigned int, but int. I still find it hard to believe, but it seems this is a hard requirement.
Instead of incrementing pointers, index them: a[i] += b[i];
G++ needs to be told -floop-interchange. A simple -O3 is not enough.
When all three criteria are met, the g++ performance is similar to what ICC delivers:
Thread 1 of 1 took 714msec
Thread 1 of 2 took 724msec
Thread 2 of 2 took 721msec
[...]
Thread 1 of 8 took 782msec
Thread 2 of 8 took 1221msec
Thread 3 of 8 took 1225msec
Thread 4 of 8 took 781msec
Thread 5 of 8 took 788msec
Thread 6 of 8 took 1262msec
Thread 7 of 8 took 1226msec
Thread 8 of 8 took 820msec
Note: The version of g++ used in this experiment is 4.9.0 on a x64 Arch linux.
Ok, I came to the conclusion that the main problem is that the processor acts on different parts of the memory in parallel and hence I assume that one has to deal with lots of cache misses which slows the process further down. Putting the actual sum function in a critical section
summutex.lock();
simplesum(pA, pB, dim);
summutex.unlock();
solves the problem of the cache missses, but of course does not yield optimal speed-up. Anyway, because now the other threads are blocked the simplesum method might as well use all available threads for the sum
void simplesum(double* a, double* b, std::size_t dim, std::size_t numberofthreads){
omp_set_num_threads(numberofthreads);
#pragma omp parallel
{
#pragma omp for
for(std::size_t i = 0; i < dim; ++i)
{
a[i]+=b[i];
}
}
}
In this case all the threads work on the same chunk on memory: it should be in the processor cache and if the processor needs to load some other parts of the memory into its cache the other threads benefit from this all well (depending whether this is L1 or L2 cache, but I reckon the details do not really matter for the sake of this discussion).
I don't claim that this solution is perfect or anywhere near optimal, but it seems to work much better than the original code. And it does not rely on some loop switching tricks which I cannot do in my actual code.

Array Reverse - XOR slower than switching with temporary object

I was reading a question regarding the fastest way to reverse an array (which ended up being less than thrilling), and I came across an interesting comment located at the link here:
https://stackoverflow.com/a/1129028/857994
The solution referenced shows these two possibilities:
//Possibility #1
void reverse(char word[])
{
int len=strlen(word);
char temp;
for (int i=0;i<len/2;i++)
{
temp=word[i];
word[i]=word[len-i-1];
word[len-i-1]=temp;
}
}
//Possibility #2
void reverse(char word[])
{
int len=strlen(word);
for (int i=0;i<len/2;i++)
{
word[i]^=word[len-i-1];
word[len-i-1]^=word[i];
word[i]^=word[len-i-1];
}
}
and the comment states: "Using XOR will be far slower than swapping using a temp object."
Nobody disputed this. So, my questions are:
Is this true?
Why is it true?
Would it still be true if this was an array of a non-built-in-type?
The xor loop contains 2 memory reads and 1 memory write per line, for a total of 6 reads and 3 writes for each loop iteration. Furthermore, there is a strong dependency between the first line the writes to word[i] and the next line that reads from word[i]. This will prevent pipelining, or if the two lines execute in parallel, the second line's read from word[i] will stall until the first line's write is complete. There is another such dependency between the 2nd and 3rd lines.
In the temp var loop, the temp var will almost certainly be stored in a CPU register, not in main memory. So the total memory I/O count for the temp var loop is 2 reads and 2 writes. There are loose data flow dependencies between the statements, but they are read-before-write which can be pipelined. The data flow dependencies in the xor example are read-after-write, which are much harder to do without stalling the pipeline.
6 reads + 3 writes compared to 2 reads + 2 writes. 2 + 2 has a distinct advantage.