CUDA spinlock implementation with Independent Thread Scheduling supported?

CUDA spinlock implementation with Independent Thread Scheduling supported? - c++

I'd like to revisit the situation of implementing a simple spinlock on CUDA, now that Independent Thread Scheduling (ITS) has been introduced for a while.
My code looks like this:
// nvcc main.cu -arch=sm_75
#include <cstdio>
#include <iostream>
#include <vector>
#include "cuda.h"
constexpr int kN = 21;
using Ptr = uint8_t*;
struct DynamicNode {
int32_t lock = 0;
int32_t n = 0;
Ptr ptr = nullptr;
};
__global__ void func0(DynamicNode* base) {
for (int i = 0; i < kN; ++i) {
DynamicNode* dn = base + i;
atomicAdd(&(dn->n), 1);
// entering the critical section
auto* lock = &(dn->lock);
while (atomicExch(lock, 1) == 1) {
}
__threadfence();
// Use a condition to artificially boost the complexity
// of loop unrolling for the compiler
if (dn->ptr == nullptr) {
dn->ptr = reinterpret_cast<Ptr>(0xf0);
}
// leaving the critical section
atomicExch(lock, 0);
__threadfence();
}
}
int main() {
DynamicNode* dev_root = nullptr;
constexpr int kRootSize = sizeof(DynamicNode) * kN;
cudaMalloc((void**)&dev_root, kRootSize);
cudaMemset(dev_root, 0, kRootSize);
func0<<<1, kN>>>(dev_root);
cudaDeviceSynchronize();
std::vector<int32_t> host_root(kRootSize / sizeof(int32_t), 0);
cudaMemcpy(host_root.data(), dev_root, kRootSize, cudaMemcpyDeviceToHost);
cudaFree((void*)dev_root);
const auto* base = reinterpret_cast<const DynamicNode*>(host_root.data());
int sum = 0;
for (int i = 0; i < kN; ++i) {
auto& dn = base[i];
std::cout << "i=" << i << " len=" << dn.n << std::endl;
sum += dn.n;
}
std::cout << "sum=" << sum << " expected=" << kN * kN << std::endl;
return 0;
}
As you can see, there's a naive spinlock implemented in func0. While I understand that this would result in deadlock on older archs (e.g. https://forums.developer.nvidia.com/t/atomic-locks/25522/2), if I compile the code with nvcc main.cu -arch=sm_75, it actually runs without blocking indefinitely.
However, what I do notice is that n in each DynamicNode went completely garbage. Here's the output on GeForce RTX 2060 (laptop), which I can reproduce deterministically:
i=0 len=21
i=1 len=230
i=2 len=19
i=3 len=18
i=4 len=17
i=5 len=16
i=6 len=15
i=7 len=14
i=8 len=13
i=9 len=12
i=10 len=11
i=11 len=10
i=12 len=9
i=13 len=8
i=14 len=7
i=15 len=6
i=16 len=5
i=17 len=4
i=18 len=3
i=19 len=2
i=20 len=1
sum=441 expected=441
Ideally, the length of all the DynamicNodes should be kN. I've also tried with larger kN (*), and it's always that only sum is correct.
Have I misunderstood something about ITS? Can ITS actually warrant such a lock implementation? If not, what am I missing here?
(*) With a smaller kN, nvcc might actually unroll the loop, from what I saw in the PTX. I've never observed any problem when the loop is unrolled.
Update 02/02/2021
I should have clarified that I tested this on CUDA 11.1. According to #robert-crovella, upgrading to 11.2 would fix the problem.
Update 02/03/2021
I tested with CUDA 11.2 driver, it still didn't fully solve the problem with a larger kN:
kN \ CUDA
11.1
11.2
21
N
OK
128
N
N

This appears to have been some sort of code generation defect in the compiler. The solution seems to be to update to CUDA 11.2 (or newer, presumably, in the future).

Related

Why is the output from race conditions not random?

I set up the following race condition to generate some random bits. However, as far as I can tell, the output is NOT random. I want to understand why (for learning purposes). Here is my code:
#include <iostream>
#include <vector>
#include <atomic>
#include <thread>
#include <cmath>
using namespace std;
void compute_entropy(const vector<bool> &randoms) {
int n0 = 0, n1 = 0;
for(bool x: randoms) {
if(!x) n0++;
else n1++;
}
double f0 = n0 / ((double)n0 + n1), f1 = n1 / ((double)n0 + n1);
double entropy = - f0 * log2(f0) - f1 * log2(f1);
for(int i = 0; i < min((int)randoms.size(), 100); ++i)
cout << randoms[i];
cout << endl;
cout << endl;
cout << f0 << " " << f1 << " " << endl;
cout << entropy << endl;
return;
}
int main() {
const int N = 1e7;
bool x = false;
atomic<bool> finish1(false), finish2(false);
vector<bool> randoms;
thread t1([&]() {
for(int i = 0; !finish1; ++i)
x = false;
});
thread t2([&]() {
for(int i = 0; !finish2; ++i)
x = true;
});
thread t3([&]() {
for(int i = 0; i < N; ++i)
randoms.push_back(x);
finish1 = finish2 = true;
});
t3.join();
t1.join();
t2.join();
compute_entropy(randoms);
return 0;
}
I compile and run it like this:
$ g++ -std=c++14 threads.cpp -o threads -lpthread
$ ./threads
0101001011000111110100101101111101100100010001111000111110001001010100011101110011011000010100001110
0.473792 0.526208
0.998017
No matter how many times I run it, the results are skewed.
With 10 million numbers, the results from a proper random number generator are as one would expect:
>>> np.mean(np.random.randint(0, 2, int(1e7)))
0.5003456
>>> np.mean(np.random.randint(0, 2, int(1e7)))
0.4997095

Why is the output from race conditions not random?
There is no guarantee that a race condition would produce random output. It is not guaranteed to be purely random nor even pseudo random of any quality.
as far as I can tell, the output is NOT random.
There exists no test that can definitely disprove randomness.
There are tests that can show that a sequence probably doesn't contain some specific patterns - and thus a sequence passing multiple such tests is probably random. However, you haven't performed such test as far as I can tell. You seem to be measuring whether the distribution of the output is even - which is a separate property from randomness. As such, your conclusion that the output isn't random is not based on a relevant measurement.
Furthermore, your program has a data race. As such, the behaviour of the entire program is undefined and here is no guarantee that the progam would behave as one might otherwise have reasonably expected.

Asynchronous writing to a bit array

TL; DR How to safely perfom a single bit update A[n/8] |= (1<<n%8); for A being a huge array of chars (i.e., setting n's bit of A true) when computing in parallel using C++11's <thread> library?
I'm performing a computation that's easy to parallelize. I'm computing elements of a certain subset of the natural numbers, and I wanna find elements that are not in the subset. For this I create a huge array (like A = new char[20l*1024l*1024l*1024l], i.e., 20GiB). A n's bit of this array is true if n lies in my set.
When doing it in parallel and setting the bits true using A[n/8] |= (1<<n%8);, I seem to get a small loss of information, supposedly due to concurring work on the same byte of A (each thread has to first read the byte, update the single bit and write the byte back). How can I get around this? Is there a way how to do this update as an atomic operation?
The code follows. GCC version: g++ (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609. The machine is an 8-core Intel(R) Xeon(R) CPU E5620 # 2.40GHz, 37GB RAM. Compiler options: g++ -std=c++11 -pthread -O3
#include <iostream>
#include <thread>
typedef long long myint; // long long to be sure
const myint max_A = 20ll*1024ll*1024ll; // 20 MiB for testing
//const myint max_A = 20ll*1024ll*1024ll*1024ll; // 20 GiB in the real code
const myint n_threads = 1; // Number of threads
const myint prime = 1543; // Tested prime
char *A;
const myint max_n = 8*max_A;
inline char getA(myint n) { return A[n/8] & (1<<(n%8)); }
inline void setAtrue(myint n) { A[n/8] |= (1<<n%8); }
void run_thread(myint startpoint) {
// Calculate all values of x^2 + 2y^2 + prime*z^2 up to max_n
// We loop through x == startpoint (mod n_threads)
for(myint x = startpoint; 1*x*x < max_n; x+=n_threads)
for(myint y = 0; 1*x*x + 2*y*y < max_n; y++)
for(myint z = 0; 1*x*x + 2*y*y + prime*z*z < max_n; z++)
setAtrue(1*x*x + 2*y*y + prime*z*z);
}
int main() {
myint n;
// Only n_threads-1 threads, as we will use the master thread as well
std::thread T[n_threads-1];
// Initialize the array
A = new char[max_A]();
// Start the threads
for(n = 0; n < n_threads-1; n++) T[n] = std::thread(run_thread, n);
// We use also the master thread
run_thread(n_threads-1);
// Synchronize
for(n = 0; n < n_threads-1; n++) T[n].join();
// Print and count all elements not in the set and n != 0 (mod prime)
myint cnt = 0;
for(n=0; n<max_n; n++) if(( !getA(n) )&&( n%1543 != 0 )) {
std::cout << n << std::endl;
cnt++;
}
std::cout << "cnt = " << cnt << std::endl;
return 0;
}
When n_threads = 1, I get the correct value cnt = 29289. When n_threads = 7, I got cnt = 29314 and cnt = 29321 on two different calls, suggesting some of the bitwise operations on a single byte were concurring.

std::atomic provides all the facilities that you need here:
std::array<std::atomic<char>, max_A> A;
static_assert(sizeof(A[0]) == 1, "Shall not have memory overhead");
static_assert(std::atomic<char>::is_always_lock_free,
"No software-level locking needed on common platforms");
inline char getA(myint n) { return A[n / 8] & (1 << (n % 8)); }
inline void setAtrue(myint n) { A[n / 8].fetch_or(1 << n % 8); }
The load in getA is atomic (equivalent to load()), and std::atomic even has built-in support for oring the stored value with another one (fetch_or), atomically of course.
When initializing A, the naive way of for (auto& a : A) a = 0; would require synchronization after every store, which you can avoid by waiving some thread-safety. std::memory_order_release only requires that what we write is visible to other threads (but not that other thread's writes are visible to us). And indeed, if you do
// Initialize the array
for (auto& a : A)
a.store(0, std::memory_order_release);
you get the safety you need without any assembly-level synchronization on x86. You could do the reverse for the loads after the threads finish, but that has no added benefit on x86 (it's just a mov either way).
Demo on the full code: https://godbolt.org/z/nLPlv1

openmp broken by visual studio c++ optimization

I've been using OpenMP with Visual Studio 2010 for quite some time by now, but today I've encountered yet another baffling quirk of VS. After cutting off all the possible suspects, I was left with the program below.
It simply counts in a cycle and sometimes makes some calculation and churns out counters.
#include "stdafx.h"
#include "omp.h"
#include <string>
#include <iostream>
#include <time.h>
int _tmain(int argc, _TCHAR* argv[])
{
int count = 0;
double a = 1;
double b = 2;
double c = 3, mean_tau = 1, r_w = 1, weights = 1, r0 = 1, tau = 1, sq_tau = 1,
r_sw = 1;
#pragma omp parallel num_threads(3) shared(count)
{
int tid = omp_get_thread_num();
int pers_count = 0;
std::string someline;
for (int i = 0; i < 100000; i++)
{
pers_count++;
#pragma omp critical
{
count++;
if ((count%10000 == 0))
{
sq_tau = (r_sw / weights) * pow( 1/ r0 * tau, 2);
std::cout << count << " " << pers_count << std::endl;
}
}
}
}
std::getchar();
return 0;
}
Now, if I compile it with optimisation disabled (/Od), it works just as it should, spitting out their shared counter alongside with its private counter (which is roughly three times smaller), something along the lines of
10000 3890
20000 6523
...
300000 100000
If I turn on the optimisation (I tried all options, but for clarity's sake let's say /O2), however, for some reason the shared count seems to become private, as I start getting something like
10000 10000
10000 10000
10000 10000
...
60000 60000
50000 50000
...
100000 100000
And now that I encountered this quirk, somehow everything that was working before is rebuilt into incorrect version even if I don't change a thing. What could be the cause of this and what can I do? Thanks.

I don't know why the shared count is behaving this way. I can provide a workaround (assuming you only use atomic operations on the shared variable):
#pragma omp critical
{
#pragma omp atomic
count++;
if ((count%10000 == 0))
{
sq_tau = (r_sw / weights) * pow( 1/ r0 * tau, 2);
std::cout << count << " " << pers_count << std::endl;
}
}

Why is my C++ code so much slower than R?

I have written the following codes in R and C++ which perform the same algorithm:
a) To simulate the random variable X 500 times. (X has value 0.9 with prob 0.5 and 1.1 with prob 0.5)
b) Multiply these 500 simulated values together to get a value. Save that value in a container
c) Repeat 10000000 times such that the container has 10000000 values
R:
ptm <- proc.time()
steps <- 500
MCsize <- 10000000
a <- rbinom(MCsize,steps,0.5)
b <- rep(500,times=MCsize) - a
result <- rep(1.1,times=MCsize)^a*rep(0.9,times=MCsize)^b
proc.time()-ptm
C++
#include <numeric>
#include <vector>
#include <iostream>
#include <random>
#include <thread>
#include <mutex>
#include <cmath>
#include <algorithm>
#include <chrono>
const size_t MCsize = 10000000;
std::mutex mutex1;
std::mutex mutex2;
unsigned seed_;
std::vector<double> cache;
void generatereturns(size_t steps, int RUNS){
mutex2.lock();
// setting seed
try{
std::mt19937 tmpgenerator(seed_);
seed_ = tmpgenerator();
std::cout << "SEED : " << seed_ << std::endl;
}catch(int exception){
mutex2.unlock();
}
mutex2.unlock();
// Creating generator
std::binomial_distribution<int> distribution(steps,0.5);
std::mt19937 generator(seed_);
for(int i = 0; i!= RUNS; ++i){
double power;
double returns;
power = distribution(generator);
returns = pow(0.9,power) * pow(1.1,(double)steps - power);
std::lock_guard<std::mutex> guard(mutex1);
cache.push_back(returns);
}
}
int main(){
std::chrono::steady_clock::time_point start = std::chrono::steady_clock::now();
size_t steps = 500;
seed_ = 777;
unsigned concurentThreadsSupported = std::max(std::thread::hardware_concurrency(),(unsigned)1);
int remainder = MCsize % concurentThreadsSupported;
std::vector<std::thread> threads;
// starting sub-thread simulations
if(concurentThreadsSupported != 1){
for(int i = 0 ; i != concurentThreadsSupported - 1; ++i){
if(remainder != 0){
threads.push_back(std::thread(generatereturns,steps,MCsize / concurentThreadsSupported + 1));
remainder--;
}else{
threads.push_back(std::thread(generatereturns,steps,MCsize / concurentThreadsSupported));
}
}
}
//starting main thread simulation
if(remainder != 0){
generatereturns(steps, MCsize / concurentThreadsSupported + 1);
remainder--;
}else{
generatereturns(steps, MCsize / concurentThreadsSupported);
}
for (auto& th : threads) th.join();
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now() ;
typedef std::chrono::duration<int,std::milli> millisecs_t ;
millisecs_t duration( std::chrono::duration_cast<millisecs_t>(end-start) ) ;
std::cout << "Time elapsed : " << duration.count() << " milliseconds.\n" ;
return 0;
}
I can't understand why my R code is so much faster than my C++ code (3.29s vs 12s) even though I have used four threads in the C++ code? Can anyone enlighten me please? How should I improve my C++ code to make it run faster?
EDIT:
Thanks for all the advice! I reserved capacity for my vectors and reduced the amount of locking in my code. The crucial update in the generatereturns() function is :
std::vector<double> cache(MCsize);
std::vector<double>::iterator currit = cache.begin();
//.....
// Creating generator
std::binomial_distribution<int> distribution(steps,0.5);
std::mt19937 generator(seed_);
std::vector<double> tmpvec(RUNS);
for(int i = 0; i!= RUNS; ++i){
double power;
double returns;
power = distribution(generator);
returns = pow(0.9,power) * pow(1.1,(double)steps - power);
tmpvec[i] = returns;
}
std::lock_guard<std::mutex> guard(mutex1);
std::move(tmpvec.begin(),tmpvec.end(),currit);
currit += RUNS;
Instead of locking every time, I created a temporary vector and then used std::move to shift the elements in that tempvec into cache. Now the elapsed time has reduced to 1.9seconds.

First of all, are you running it in release mode?
Switching from debug to release reduced the running time from ~15s to ~4.5s on my laptop (windows 7, i5 3210M).
Also, reducing the number of threads to 2 instead of 4 in my case (I just have 2 cores but with hyperthreading) further reduced the running time to ~2.4s.
Changing the variable power to int (as jimifiki also suggested) also offered a slight boost, reducing the time to ~2.3s.

I really enjoyed your question and I tried the code at home. I tried to change the random number generator, my implementation of std::binomial_distribution requires on average about 9.6 calls of generator().
I know the question is more about comparing R with C++ performances, but since you ask "How should I improve my C++ code to make it run faster?" I insist with pow optimization. You can easily avoid one half of the call by precomputing either 0.9^steps or 1.1^steps before the for loop. This makes your code run a bit faster:
double power1 = pow(0.9,steps);
double ratio = 1.1/0.9;
for(int i = 0; i!= RUNS; ++i){
...
returns = myF1 * pow(myF2, (double)power);
Analogously you can improve the R code:
...
ratio <-1.1/0.9
pow1 = 0.9^steps
result <- rep(ratio,times=MCsize)^rep(pow1,times=MCsize)
...

Probably doesn't help you that much, but
start by using pow(double,int) when your exponent is an int.
int power;
returns = pow(0.9,power) * pow(1.1,(int)steps - power);
Can you see any improvement?

Performance swapping integers vs double

For some reason my code is able to perform swaps on doubles faster than on the integers. I have no idea why this would be happening.
On my machine the double swap loop completes 11 times faster than the integer swap loop. What property of doubles/integers make them perform this way?
Test setup
Visual Studio 2012 x64
cpu core i7 950
Build as Release and run exe directly, VS Debug hooks skew things
Output:
Process time for ints 1.438 secs
Process time for doubles 0.125 secs
#include <iostream>
#include <ctime>
using namespace std;
#define N 2000000000
void swap_i(int *x, int *y) {
int tmp = *x;
*x = *y;
*y = tmp;
}
void swap_d(double *x, double *y) {
double tmp = *x;
*x = *y;
*y = tmp;
}
int main () {
int a = 1, b = 2;
double d = 1.0, e = 2.0, iTime, dTime;
clock_t c0, c1;
// Time int swaps
c0 = clock();
for (int i = 0; i < N; i++) {
swap_i(&a, &b);
}
c1 = clock();
iTime = (double)(c1-c0)/CLOCKS_PER_SEC;
// Time double swaps
c0 = clock();
for (int i = 0; i < N; i++) {
swap_d(&d, &e);
}
c1 = clock();
dTime = (double)(c1-c0)/CLOCKS_PER_SEC;
cout << "Process time for ints " << iTime << " secs" << endl;
cout << "Process time for doubles " << dTime << " secs" << endl;
}
It seems that VS only optimized one of the loops as Blastfurnace explained.
When I disable all compiler optimizations and have my swap code inline inside the loops, I got the following results (I also switched my timer to std::chrono::high_resolution_clock):
Process time for ints 1449 ms
Process time for doubles 1248 ms

You can find the answer by looking at the generated assembly.
Using Visual C++ 2012 (32-bit Release build) the body of swap_i is three mov instructions but the body of swap_d is completely optimized away to an empty loop. The compiler is smart enough to see that an even number of swaps has no visible effect. I don't know why it doesn't do the same with the int loop.
Just changing #define N 2000000000 to #define N 2000000001 and rebuilding causes the swap_d body to perform actual work. The final times are close on my machine with swap_d being about 3% slower.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

CUDA spinlock implementation with Independent Thread Scheduling supported? - c++

This appears to have been some sort of code generation defect in the compiler. The solution seems to be to update to CUDA 11.2 (or newer, presumably, in the future).

Related

Why is the output from race conditions not random?

Asynchronous writing to a bit array

openmp broken by visual studio c++ optimization

Why is my C++ code so much slower than R?

Performance swapping integers vs double

Categories

Resources