USRPs parallel control in Matlab

USRPs parallel control in Matlab - c++

I would like to thank to everyone in ahead for reading this. I have a very specific issue. I need to control three USRPs simultaneously from Matlab. Here is my serial solution:
`frameLength=180;
%transmit
R=zeros(frameLength,8000);
i = 1;j = 1;k=1;nStop = 8000;
while(j<nStop) % three times of lenght of frame
step(hTx1,data1(:,i)); % Tx1 transmit data in interval i
step(hTx2,data2(:,i)); %Tx2 transmit data in interval i
[Y, LEN]=step(hRx); %receiver recieve data from Tx1 and Tx2 (but shifted)
data=Y; %just reorganzied
if mod(i,nPacket)==0 % end of column (packet, start new packet)
i = 0;
end
if LEN==frameLength %just reorganizing
R(:,k)=data;
k=k+1
if k == nStop
break; %end
end
end
i = i+1;
j = j+1
end`
This solution have one problem, its not fully synchronized because steps functions are serially executed, therefore is little delay between signals from Tx1 a Tx2 on receiver.
If iI try this with parfor, lets assume that matlabpool invoke 4 workers (cores) it give me error right on first "step" function because multiple workers try to execute same function and therefore it cause collision. Step is Matlab routine to access Universal Software Radio Peripheral (USRP). But its little complicated when one core already execute that command with argument of some USRP, that USRP is busy and other call of this command causes error.
Unfortunatelly there is no scheduling for parallel loops to assign individual "step" commands to each core.
My question is, if is there any question how to parallelize at least three steps commands with prevent cores collision? If only these three steps commands, the rest can be done serial way, it doesn't matter.
It could be done in worst case by invoking three matlab instances where every instance controls one USRP and before step command could be some external routine (like x bit counter in C for instance) to synchronize this tasks.
Ive already tried to use this semaphore routine to create barrier for every core to stop and wait before step commands. http://www.mathworks.com/matlabcentral/fileexchange/45504-semaphore-posix-and-windows
This example is shown here:
function init()
(1) exitThreads = false; % used to exit func1, func2, func3 threads.
(2)cntMutexKey = 5; % mutex for doneCnt.
(3)doneCnt = 0; % func1-3 increment this when they finish.
(4)barrierCnt = 0; %global
(5)barrierKey = 7; %global
(6)paralellTasksDoneKey = 8; %global, semaphore to tell main loop when func1-3 are done.
(7)semaphore('create', cntMutexKey, 1);
(8)semaphore('create', barrierKey, 4); %Has count of 3 for each of the three functions to execute in parallel. We want to initialize it to 0.
(9)semaphore('wait', barrierKey); %now it has 3
(10)semaphore('wait', barrierKey); %now it has 2
(11)semaphore('wait', barrierKey); %now it has 1
(12)semaphore('wait', barrierKey); %now it has 0
(13)semaphore('create', paralellTasksDoneKey, 1);
(14)semaphore('wait', paralellTasksDoneKey); %Set it to 0.
(15)funList = {#func1,#func2,#func3};
(16)matlabpool
(17)parfor i=1:length(funList) %Start 3 threads.
funList{i}();
(18)end
(jump to) mycycle(); %Now run your code.
exitThreads = true; %Tell func1-3 to exit.
end
global exitThreads
while(~exitThreads)
barrier();
step(hTx1,data1(:,l));
done();
end
end
function func2()
global exitThreads
while(~exitThreads)
barrier();
step(hTx2,data2(:,l));
done();
end
end
function func3()
global exitThreads Y LEN
while(~exitThreads)
barrier();
[Y, LEN]=step(hRx); %need [Y,LEN] global or something, and run "parfor j=1:8000" sequentially and not in paralell.
done();
end
end
(25)function barrier()
(26)semaphore('wait', cntMutexKey); %init to 1, after 3 cores increment to 4 it proceed IF
(27)barrierCnt = barrierCnt+1; %changed from barrierCnt += 1
(28)if(barrierCnt == 4) %We now know that func1,func2,func3,yourcode are all at the barrier.
(29)barrierCnt = 0; %reset count
(30)semaphore('post', cntMutexKey);
(31)semaphore('post', barrierKey); %Increment barrier count, so a func will run.
(32)semaphore('post', barrierKey); %Increment barrier count, so a func will run.
(33)semaphore('post', barrierKey); %Increment barrier count, so a func will run.
else
(34)semaphore('post', cntMutexKey);
(get stuck here)semaphore('wait', barrierKey); %Wait for other threads (the barrier).
end
end
function done()
semaphore('wait', doneKey);
doneCnt = doneCnt+ 1; %changed from doneCnt += 1
if(doneCnt == 3)
semaphore('post', paralellTasksDoneKey);
doneCnt = 0; %Reset counter.
end
semaphore('post', doneKey);
end
function mycycle()
(19) global paralellTasksDoneKey Y LEN data
(21)for j=1:8000 % three times send and recieved frame with nPackets,
(22)i=1; %example is done with this loop handled sequentially.
(23)l=1; % More complex to allow this in paralell, but its not necessary
(24)k=1;
(jump to) barrier(); %Want loop to stop here & allow func1,func2,func3 do to their work.
semaphore('wait', paralellTasksDoneKey); %Wait for func1,func2,func3 to finish.
data=Y;
if mod(i,nPacket)==0 %end of frame
i = 0;
end
if LEN==frameLength
R(:,k)=data;
k=k+1;
end
i = i+1;
l=l+1;
end
end
*Note: numbers and jumps in parenthesis indicate flow of program, step by step from debbuger. End program get stuck there (35).
Or it could be maybe done by using the OpenMP library in C, to run those commands in parallel, but I've non experience with that, Iam not so skilled programmer. viz [http://bisqwit.iki.fi/story/howto/openmp/#Sections][2]
Sorry for a little bit larger file, but I want to show you my solutions (not fully mine) because it can be helpful for anyone who read this and is more skilled. I will be thankful for any kind of help or advice. Have a nice day for all of you.

I would suggest using SPMD rather than PARFOR for this sort of problem. Inside SPMD, you can use labBarrier to synchronise the workers.

Related

How do I use a for loop to flash a single LED on and off? (c++) (Mbed Studio) (Nucleo Board)

I am a complete coding noob but I am trying to get all LED's to flash on and off 5 times while specifically using a for loop (it has to be a for loop).
The LED in question is attached to a bus (also has to be the case) with the integer assignment of 76.
EDIT: When I try a simple for loop with a counter of 5 and then turn the LED's on and off in the statement it only does it once. It may be simpler to assign my LED flashes to the count in the for loop if this is possible?
My thinking so far is to either design a for loop to repeat the same number two numbers 5 times (76 and 0) and assign the bus to the count in the statement however I am struggling to get my head around how to do this only 5 times (My mind can only perceive creating a nested loop endlessly repeating) or to somehow nest a for loop with the operation I want on the inner loop counting off of the outer loop.
Can anyone tell me if I'm on the right track and if so how to either run my first idea only 5 times or assign my Bus actions to my outer loop for the second?
Code so far is below but I have only managed to get the LEDs to turn on.
PortOut traffic(PortC, 0b0000000001001100);
// The default flash rate is once per second for all tasks
int main()
{
int i;
int b;
traffic = 0;
// 1. Flash the ALL the LEDs 5 times using a for loop, when finished the LEDs must be OFF
for (i = 0; i < 5; i= i + 1)
{
(printf("i%d\n", i));
for (b = i; b < 5;)
{
traffic = 76;
wait_us(1000000);
traffic = 0;
}
}
Many thanks in advance,
Joe.
Tried nesting a for loop to repeat the same two integers 5 times in order to assign the Bus to the count,
Only managed to endlessly repeat for loop.
Tried nesting a for loop to count to 5 on the outer loop and flash LED's on the inner loop,
Only managed to switch LED's on once.

Your outer loop is counting the quantity of pulses.
The contents of the loop determine the frequency that an LED is on or off:
for (int counter = 0; counter < 5; ++counter)
{
// Turn on the LEDs
traffic = 76;
// Wait while the LEDs are on.
waitus(1000000);
// Turn OFF the LEDs
traffic = 0;
// Wait while the LEDs are OFF.
waitus(1000000);
} // End of a pulse
The issue is that you a delay after you turn them on and after the LEDs are turned off. You can adjust the different delay amounts; they don't need to be on and off at the same time. The delays should be able to adjust the brightness also.

what is the optimal Multithreading scenario for processing a long file lines?

I have a big file and i want to read and also [process] all lines (even lines) of the file with multi threads.
One suggests to read the whole file and break it to multiple files (same count as threads), then let every thread process a specific file. as this idea will read the whole file, write it again and read multiple files it seems to be slow (3x I/O) and i think there must be better scenarios,
I myself though this could be a better scenario:
One thread will read the file and put the data on a global variable and other threads will read the data from that variable and process. more detailed:
One thread will read the main file with running func1 function and put each even line on a Buffer: line1Buffer of a max size MAX_BUFFER_SIZE and other threads will pop their data from the Buffer and process it with running func2 function. in code:
Global variables:
#define MAX_BUFFER_SIZE 100
vector<string> line1Buffer;
bool continue = true;// to end thread 2 to last thread by setting to false
string file = "reads.fq";
Function func1 : (thread 1)
void func1(){
ifstream ifstr(file.c_str());
for (long long i = 0; i < numberOfReads; i++) { // 2 lines per read
getline(ifstr,ReadSeq);
getline(ifstr,ReadSeq);// reading even lines
while( line1Buffer.size() == MAX_BUFFER_SIZE )
; // to delay when the buffer is full
line1Buffer.push_back(ReadSeq);
}
continue = false;
return;
}
And function func2 : (other threads)
void func2(){
string ReadSeq;
while(continue){
if(line2Buffer.size() > 0 ){
ReadSeq = line1Buffer.pop_back();
// do the proccessing....
}
}
}
About the speed:
If the reading part is slower so the total time will be equal to reading the file for just one time(and the buffer may just contain 1 file at each time and hence just 1 another thread will be able to work with thread 1). and if the processing part is slower then the total time will be equal to the time for the whole processing with numberOfThreads - 1 threads. both cases is faster than reading the file and writing in multiple files with 1 thread and then read the files with multi threads and process...
and so there is 2 question:
1- how to call the functions by threads the way thread 1 runs func1 and others run func2 ?
2- is there any faster scenario?
3-[Deleted] anyone can extend this idea to M threads for reading and N threads for processing? obviously we know :M+N==umberOfThreads is true
Edit: the 3rd question is not right as multiple threads can't help in reading a single file
Thanks All

An other approach could be interleaved thread.
Reading is done by every thread, but only 1 at once.
Because of the waiting in the very first iteration, the
threads will be interleaved.
But this is only an scaleable option, if work() is the bottleneck
(then every non-parallel execution would be better)
Thread:
while (!end) {
// should be fair!
lock();
read();
unlock();
work();
}
basic example: (you should probably add some error-handling)
void thread_exec(ifstream* file,std::mutex* mutex,int* global_line_counter) {
std::string line;
std::vector<std::string> data;
int i;
do {
i = 0;
// only 1 concurrent reader
mutex->lock();
// try to read the maximum number of lines
while(i < MAX_NUMBER_OF_LINES_PER_ITERATION && getline(*file,line)) {
// only the even lines we want to process
if (*global_line_counter % 2 == 0) {
data.push_back(line);
i++;
}
(*global_line_counter)++;
}
mutex->unlock();
// execute work for every line
for (int j=0; j < data.size(); j++) {
work(data[j]);
}
// free old data
data.clear();
//until EOF was not reached
} while(i == MAX_NUMBER_OF_LINES_PER_ITERATION);
}
void process_data(std::string file) {
// counter for checking if line is even
int global_line_counter = 0;
// open file
ifstream ifstr(file.c_str());
// mutex for synchronization
// maybe a fair-lock would be a better solution
std::mutex mutex;
// create threads and start them with thread_exec(&ifstr, &mutex, &global_line_counter);
std::vector<std::thread> threads(NUM_THREADS);
for (int i=0; i < NUM_THREADS; i++) {
threads[i] = std::thread(thread_exec, &ifstr, &mutex, &global_line_counter);
}
// wait until all threads have finished
for (int i=0; i < NUM_THREADS; i++) {
threads[i].join();
}
}

What is your bottleneck? Hard disk or processing time?
If it's the hard disk, then you're probably not going to get any more performance out as you've hit the limits of the hardware. Concurrent reads are by far faster than trying to jump around the file. Having multiple threads trying to read your file will almost certainly reduce the overall speed as it will increase disk thrashing.
A single thread reading the file and a thread pool (or just 1 other thread) to deal with the contents is probably as good as you can get.
Global variables:
This is a bad habit to get into.

Assume having #p treads, two scenarios mentioned in the post and answers:
1) Reading with 'a' thread and processing with other threads, in this case #p-1 thread will process in comparison with only one thread reading. assume the time for full operation is jobTime and time for processing with n threads is pTime(n) so:
worst case occurs when reading time is very slower than processing and jobTime = pTime(1)+readTime and the best case is when the processing is slower than reading in which jobTime is equal to pTime(#p-1)+readTime
2) read and process with all #p threads. in this scenario every thread needs to do two steps. first step is to read a part of the file with size MAX_BUFFER_SIZE which is sequential; means no two threads can read at one time. but the second part is processing the read data which can be parallel. this way in the worst case jobTime is pTime(1)+readTime as before (but*), but the best optimized case is pTime(#p)+readTime which is better than previous.
*: in 2nd approach's worst case, however reading is slower but you can find a optimized MAX_BUFFER_SIZE in which (in the worst case) some reading with one thread will overlaps with some processing with another thread. with this optimized MAX_BUFFER_SIZE the jobTime will be less than pTime(1)+readTime and could diverge to readTime

First off, reading a file is a slow operation so unless you are doing some superheavy processing, the file reading will be limiting.
If you do decide to go the multithreaded route a queue is the right approach. Just make sure you push in front an pop out back. An stl::deque should work well. Also you will need to lock the queue with a mutex and sychronize it with a conditional variable.
One last thing is you will need to limit the size if the queue for the scenario where we are pushing faster than we are popping.

Boost Thread_Group in a loop is very slow

I wanted to use threading to run check multiple images in a vector at the same time. Here is the code
boost::thread_group tGroup;
for (int line = 0;line < sourceImageData.size(); line++) {
for (int pixel = 0;pixel < sourceImageData[line].size();pixel++) {
for (int im = 0;im < m_images.size();im++) {
tGroup.create_thread(boost::bind(&ClassX::ClassXFunction, this, line, pixel, im));
}
tGroup.join_all();
}
}
This creates the thread group and loops thru lines of pixel data and each pixel and then multiple images. Its a weird project but anyway I bind the thread to a method in the same instance of the class this code is in so "this" is used. This runs through a population of about 20 images, binding each thread as it goes and then when it is done looping the join_all function takes effect when the threads are done. Then it goes to the next pixel and starts over again.
I'v tested running 50 threads at the same time with this simple program
void run(int index) {
for (int i = 0;i < 100;i++) {
std::cout << "Index : " <<index<<" "<<i << std::endl;
}
}
int main() {
boost::thread_group tGroup;
for (int i = 0;i < 50;i++){
tGroup.create_thread(boost::bind(run, i));
}
tGroup.join_all();
int done;
std::cin >> done;
return 0;
}
This works very quickly. Even though the method the threads are bound to in the previous program is more complicated it shouldn't be as slow as it is. It takes like 4 seconds for one loop of sourceImageData (line) to complete. I'm new to boost threading so I don't know if something is blatantly wrong with the nested loops or otherwise. Any insight is appreciated.

The answer is simple. Don't start that many threads. Consider starting as many threads as you have logical CPU cores. Starting threads is very expensive.
Certainly never start a thread just to do one tiny job. Keep the threads and give them lots of (small) tasks using a task queue.
See here for a good example where the number of threads was similarly the issue: boost thread throwing exception "thread_resource_error: resource temporarily unavailable"
In this case I'd think you can gain a lot of performance by increasing the size of each task (don't create one per pixel, but per scan-line for example)

I believe the difference here is in when you decide to join the threads.
In the first piece of code, you join the threads at every pixel of the supposed source image. In the second piece of code, you only join the threads once at the very end.
Thread synchronization is expensive and often a bottleneck for parallel programs because you are basically pausing execution of any new threads until ALL threads that need to be synchronized, which in this case is all the threads that are active, are done running.
If the iterations of the innermost loop(the one with im) are not dependent on each other, I would suggest you join the threads after the entire outermost loop is done.

CUDA, mutex and atomicCAS()

Recently I started to develop on CUDA and faced with the problem with atomicCAS().
To do some manipulations with memory in device code I have to create a mutex, so that only one thread could work with memory in critical section of code.
The device code below runs on 1 block and several threads.
__global__ void cudaKernelGenerateRandomGraph(..., int* mutex)
{
int i = threadIdx.x;
...
do
{
atomicCAS(mutex, 0, 1 + i);
}
while (*mutex != i + 1);
//critical section
//do some manipulations with objects in device memory
*mutex = 0;
...
}
When first thread executes
atomicCAS(mutex, 0, 1 + i);
mutex is 1. After that first thread changes its status from Active to Inactive, and line
*mutex = 0;
is not executed. Other threads stays forever in loop. I have tried many variants of this cycle like while(){};, do{}while();, with temp variable = *mutex inside loop, even variant with if(){} and goto. But result is the same.
The host part of code:
...
int verticlesCount = 5;
int *mutex;
cudaMalloc((void **)&mutex, sizeof(int));
cudaMemset(mutex, 0, sizeof(int));
cudaKernelGenerateRandomGraph<<<1, verticlesCount>>>(..., mutex);
I use Visual Studio 2012 with CUDA 5.5.
The device is NVidia GeForce GT 240 with compute capability 1.2.
Thanks in advance.
UPD:
After some time working on my diploma project this spring, I found a solution for critical section on cuda.
This is a combination of lock-free and mutex mechanisms.
Here is working code. Used it to impelment atomic dynamic-resizable array.
// *mutex should be 0 before calling this function
__global__ void kernelFunction(..., unsigned long long* mutex)
{
bool isSet = false;
do
{
if (isSet = atomicCAS(mutex, 0, 1) == 0)
{
// critical section goes here
}
if (isSet)
{
mutex = 0;
}
}
while (!isSet);
}

The loop in question
do
{
atomicCAS(mutex, 0, 1 + i);
}
while (*mutex != i + 1);
would work fine if it were running on the host (CPU) side; once thread 0 sets *mutex to 1, the other threads would wait exactly until thread 0 sets *mutex back to 0.
However, GPU threads are not as independent as their CPU counterparts. GPU threads are grouped into groups of 32, commonly referred to as warps. Threads in the same warp will execute instructions in complete lock-step. If a control statement such as if or while causes some of the 32 threads to diverge from the rest, the remaining threads will wait (i.e. sleeps) for the divergent threads to finish. [1]
Going back to the loop in question, thread 0 becomes inactive because threads 1, 2, ..., 31 are still stuck in the while loop. So thread 0 never reaches the line *mutex = 0, and the other 31 threads loops forever.
A potential solution is to make a local copy of the shared resource in question, let 32 threads modify the copy, and then pick one thread to 'push' the change back to the shared resource. A __shared__ variable is ideal in this situation: it will be shared by the threads belonging to the same block but not other blocks. We can use __syncthreads() to fine-control the access of this variable by the member threads.
[1] CUDA Best Practices Guide - Branching and Divergence
Avoid different execution paths within the same warp.
Any flow control instruction (if, switch, do, for, while) can significantly affect the instruction throughput by causing threads of the same warp to diverge; that is, to follow different execution paths. If this happens, the different execution paths must be serialized, since all of the threads of a warp share a program counter; this increases the total number of instructions executed for this warp. When all the different execution paths have completed, the threads converge back to the same execution path.

C , C++ unsynchronized threads returning a strange result

Okay, i have this question in one regarding threads.
there are two unsynchronized threads running simultaneously and using a global resource "int num"
1st:
void Thread()
{
int i;
for ( i=0 ; i < 100000000; i++ )
{
num++;
num--;
}
}
2nd:
void Thread2()
{
int j;
for ( j=0 ; j < 100000000; j++ )
{
num++;
num--;
}
}
The question states: what are the possible values of the variable "num" at the end of the program.
now i would say 0 will be the value of num at the end of the program but, try and run this code and you will find out that the result is quite random,
and i can't understand why?
The full code:
#include <windows.h>
#include <process.h>
#include <stdio.h>
int static num=0;
void Thread()
{
int i;
for ( i=0 ; i < 100000000; i++ )
{
num++;
num--;
}
}
void Thread2()
{
int j;
for ( j=0 ; j < 100000000; j++ )
{
num++;
num--;
}
}
int main()
{
long handle,handle2,code,code2;
handle=_beginthread( Thread, 0, NULL );
handle2=_beginthread( Thread2, 0, NULL );
while( (GetExitCodeThread(handle,&code)||GetExitCodeThread(handle2,&code2))!=0 );
TerminateThread(handle, code );
TerminateThread(handle2, code2 );
printf("%d ",num);
system("pause");
}

num++ and num-- don't have to be atomic operations. To take num++ as an example, this is probably implemented like:
int tmp = num;
tmp = tmp + 1;
num = tmp;
where tmp is held in a CPU register.
Now let's say that num == 0, both threads try to execute num++, and the operations are interleaved as follows:
Thread A Thread B
int tmp = num;
tmp = tmp + 1;
int tmp = num;
tmp = tmp + 1;
num = tmp;
num = tmp;
The result at the end will be num == 1 even though it should have been incremented twice. Here, one increment is lost; in the same way, a decrement could be lost as well.
In pathological cases, all increments of one thread could be lost, resulting in num == -100000000, or all decrements of one thread could be lost, resulting in num == +100000000. There may even be more extreme scenarios lurking out there.
Then there's also other business going on, because num isn't declared as volatile. Both threads will therefore assume that the value of num doesn't change, unless they are the one changing it. This allows the compiler to optimize away the entire for loop, if it feels so inclined!

The possible values for num include all possible int values, plus floating point values, strings, and jpegs of nasal demons. Once you invoke undefined behavior, all bets are off.
More specifically, modifying the same object from multiple threads without synchronization results in undefined behavior. On most real-world systems, the worst effects you see will probably be missing or double increments or decrements, but it could be much worse (memory corruption, crashing, file corruption, etc.). So just don't do it.
The next upcoming C and C++ standards will include atomic types which can be safely accessed from multiple threads without any synchronization API.

You speak of threads running simultaneously which actually might not be the case if you only have one core in your system. Let's assume that you have more than one.
In the case of multiple devices having access to main memory either in the form of CPUs or bus-mastering or DMA they must be synchronized. This is handled by the lock prefix (implicit for the instruction xchg). It accesses a physical wire on the system bus which essentially signals all devices present to stay away. It is, for example, part of the Win32 function EnterCriticalSection.
So in the case of two cores on the same chip accessing the same position the result would be undefined which may seem strange considering some synchronization should occur since they share the same L3 cache (if there is one). Seems logical, but it doesn't work that way. Why? Because a similar case occurs when you have the two cores on different chips (i e don't have a shared L3 cache). You can't expect them to be synchronized. Well you can but consider all the other devices having access to main memory. If you plan to synchronize between two CPU chips you can't stop there - you have to perform a full-blown synchronization that blocks out all devices with access and to ensure a successful synchronization all the other devices need time to recognize that a synchronization has been requested and that takes a long time, especially if a device has been granted access and is performing a bus-mastering operation which must be allowed to complete. The PCI bus will perform an operation every 0.125 us (8 MHz) and considering that your CPUs run at 400 times you're looking at A LOT of wait states. Then consider that several PCI clock cycles might be required.
You could argue that a medium type (memory bus only) lock should exist but this means an additional pin on every processor and additional logic in every chipset just to handle a case which is really a misunderstanding on the programmer's part. So it's not implemented.
To sum it up: a generic synchronization that would handle your situation would render your PC useless due to it always having to wait for the last device to check in and ok the synchronization. It is a better solution to let it be optional and only insert wait states when the developer has determined that it is absolutely necessary.
This was so much fun that I played a little with the example code and added spinlocks to see what would happen. The spinlock components were
// prototypes
char spinlock_failed (spinlock *);
void spinlock_leave (spinlock *);
// application code
while (spinlock_failed (&sl)) ++n;
++num;
spinlock_leave (&sl);
while (spinlock_failed (&sl)) ++n;
--num;
spinlock_leave (&sl);
spinlock_failed was constructed around the "xchg mem,eax" instruction. Once it failed (at not setting the spinlock <=> succeeded at setting it) spinlock_leave would just assign to it with "mov mem,0". The "++n" counts the total number of retries.
I changed the loop to 2.5 million (because with two threads and two spinlocks per loop I get 10 million spinlocks, nice and easy to round with) and timed the sequences with the "rdtsc" count on a dual-core Athlon II M300 # 2GHz and this is what I found
Running one thread without timing
(except for the main loop) and locks
(as in the original example) 33748884
<=> 16.9 ms => 13.5 cycles/loop.
Running one thread i e no other core
trying took 210917969 cycles <=>
105.5 ms => 84,4 cycles/loop <=> 0.042 us/loop. The spinlocks required 112581340 cycles <=> 22.5 cycles per
spinlocked sequence. Still, the
slowest spinlock required 1334208
cycles: that's 667 us or only 1500
every second.
So, the additon of spinlocks unaffected by another CPU added several hundred percent to the total execution time. The final value in num was 0.
Running two threads without spinlocks
took 171157957 cycles <=> 85.6 ms =>
68.5 cycles/loop. Num contained 10176.
Two threads with spinlocks took
4099370103 <=> 2049 ms => 1640
cycles/loop <=> 0.82 us/loop. The
spinlocks required 3930091465 cycles
=> 786 cycles per spinlocked sequence. The slowest spinlock
required 27038623 cycles: thats 13.52
ms or only 74 every second. Num
contained 0.
Incidentally the 171157957 cycles for two threads without spinlocks compares very favorably to two threads with spinlocks where the spinlock time has been removed: 4099370103-3930091465 = 169278638 cycles.
For my sequence the spinlock competition caused 21-29 million retries per thread which comes out to 4.2-5.8 retries per spinlock or 5.2-6.8 tries per spinlock. Addition of spinlocks caused an execution time penalty of 1927% (1500/74-1). The slowest spinlock required 5-8% of all tries.

As Thomas said, the results are unpredictable because your increment and decrement are non-atomic. You can use InterlockedIncrement and InterlockedDecrement -- which are atomic -- to see a predictable result.
Interlocked Variable Access (MSDN)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js