Does `mem_fence` provide consistency between work-groups?

Does `mem_fence` provide consistency between work-groups? - concurrency

I am trying to implement the bounding-box calculation as described here. Long story short, I have a binary tree of bounding boxes. The leaf nodes are all filled in, and now it is time to calculate the internal nodes. In addition to the nodes (each defining the child/parent indices), there is a counter for each internal node.
Starting at each leaf node, the parent node is visited and its flag atomically incremented. If this is the first visit to the node, the thread exits (as only one child is guaranteed to have been initialized). If it is the second visit, then both children are initialized, its bounding box is calculated and we continue with that node's parents.
Is the mem_fence between reading the flag and reading the data of its children sufficient to guarantee the data in the children will be visible?
kernel void internalBounds(global struct Bound * const bounds,
global unsigned int * const flags,
const global struct Node * const nodes) {
const unsigned int n = get_global_size(0);
const size_t D = 3;
const size_t leaf_start = n - 1;
size_t node_idx = leaf_start + get_global_id(0);
do {
node_idx = nodes[node_idx].parent;
write_mem_fence(CLK_GLOBAL_MEM_FENCE);
// Mark node as visited, both children initialized on second visit
if (atomic_inc(&flags[node_idx]) < 1)
break;
read_mem_fence(CLK_GLOBAL_MEM_FENCE);
const global unsigned int * child_idxs = nodes[node_idx].internal.children;
for (size_t d = 0; d < D; d++) {
bounds[node_idx].min[d] = min(bounds[child_idxs[0]].min[d],
bounds[child_idxs[1]].min[d]);
bounds[node_idx].max[d] = max(bounds[child_idxs[0]].max[d],
bounds[child_idxs[1]].max[d]);
}
} while (node_idx != 0);
}
I am limited to OpenCL 1.2.

No it doesn't. CLK_GLOBAL_MEM_FENCE only provides consistency within the work group when accessing global memory. There is no inter-workgroup synchronization in OpenCL 1.x
Try to use a single, large workgroup and iterate over the data. And/or start with some small trees that will fit inside a single work group.

https://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/mem_fence.html
mem_fence(...) syncs mem-accesses for only single work-item. Even if all work-items have this line, they may not hit(and continue) it at the same time.
barrier(...) does synchronize for all work items in a work group and have them wait for the slowest one(that isa accessing the specified memory given as parameter), but only connected to its own work groups work items.(such as only 64 or 256 for amd-intel and maybe 1024 for nvidia) because an opencl device driver implementation may be designed to finish all wavefronts before loading new shards of wavefronts because all global items would simply not fit inside chip memory(such as 64M work items each using 1kB local memory that need 64GB memory! --> even software emulation would need hundreds or thousands of passes and decrease performance to a level of single core cpu)
Global sync (where all work groups synchronized) is not possible.
Just in case work item work group and processing elements get mixed meanings,
OpenCL: Work items, Processing elements, NDRange
Atomic function you put there is already accesing global memory so adding group-scope synchronization shouldn't be important.
Also check machine codes if
bounds[child_idxs[0]].min[d]
is getting whole bounds[child_idxs[0]] struct into private memory before accessing to min[d]. If yes, you can separate min as an independent array access its items to have %100 more memory bandwidth for it.
Test on intel hd 400, more than 100000 threads
__kernel void fenceTest( __global float *c,
__global int *ctr)
{
int id=get_global_id(0);
if(id<128000)
for(int i=0;i<20000;i++)
{
c[id]+=ctr[0];
mem_fence(CLK_GLOBAL_MEM_FENCE);
}
ctr[0]++;
}
2900ms (c array has garbage)
__kernel void fenceTest( __global float *c,
__global int *ctr)
{
int id=get_global_id(0);
if(id<128000)
for(int i=0;i<20000;i++)
{
c[id]+=ctr[0];
}
ctr[0]++;
}
500 ms(c array has garbage). 500ms is ~6x the performance of fence version(my laptop has single channel 4GB ram which is only 5-10 GB/s but its igpu local memory has nearly 38GB/s(64B per cycle and 600 MHz frequency)). Local fence version takes 700ms so the fenceless version doesn't even touching cache or local memory for some iterations as it seems.
Without loop, it takes 8-9 ms so it wasn't optimizing the loop in these kernels I suppose.
Edit:
int id=get_global_id(0);
if(id==0)
{
atom_inc(&ctr[0]);
mem_fence(CLK_GLOBAL_MEM_FENCE);
}
mem_fence(CLK_GLOBAL_MEM_FENCE);
c[id]+=ctr[0];
behaves exactly as
int id=get_global_id(0);
if(id==0)
{
ctr[0]++;
mem_fence(CLK_GLOBAL_MEM_FENCE);
}
mem_fence(CLK_GLOBAL_MEM_FENCE);
c[id]+=ctr[0];
for this Intel igpu device(only by chance, but it proves changed memory is visible by "all" trailing threads, but doesn't prove it always happens(such as first compute unit hiccups and 2nd starts first for example) and it is not atomic for more than single threads accessing it).

Related

Multithread performance drops down after a few operations

I encountered this weird bug in a c++ multithread program on linux. The multithreaded part basically executes a loop. One single iteration first loads a sift file containing some features. And then it queries these features against a tree. Since I have a lot of images, I used multiple threads to do this querying. Here is the code snippets.
struct MultiMatchParam
{
int thread_id;
float *scores;
double *scores_d;
int *perm;
size_t db_image_num;
std::vector<std::string> *query_filenames;
int start_id;
int num_query;
int dim;
VocabTree *tree;
FILE *file;
};
// multi-thread will do normalization anyway
void MultiMatch(MultiMatchParam &param)
{
// Clear scores
for(size_t t = param.start_id; t < param.start_id + param.num_query; t++)
{
for (size_t i = 0; i < param.db_image_num; i++)
param.scores[i] = 0.0;
DTYPE *keys;
int num_keys;
keys = ReadKeys_sfm((*param.query_filenames)[t].c_str(), param.dim, num_keys);
int normalize = true;
double mag = param.tree->MultiScoreQueryKeys(num_keys, normalize, keys, param.scores);
delete [] keys;
}
}
I run this on a 8-core cpu. At first it runs perfectly and the cpu usage is nearly 100% on all 8 cores. After each thread has queried several images (about 20 images), all of a sudden the performance (cpu usage) drops drastically, down to about 30% across all eight cores.
I doubt the key to this bug is concerned with this line of code.
double mag = param.tree->MultiScoreQueryKeys(num_keys, normalize, keys, param.scores);
Since if I replace it with another costly operations (e.g., a large for-loop containing sqrt). The cpu usage is always nearly 100%. This MultiScoreQueryKeys function does a complex operation on a tree. Since all eight cores may read the same tree (no write operation to this tree), I wonder whether the read operation has some kind of blocking effect. But it shouldn't have this effect because I don't have write operations in this function. Also the operations in the loop are basically the same. If it were to block the cpu usage, it would happen in the first few iterations. If you need to see the details of this function or other part of this project, please let me know.

Use std::async() instead of zeta::SimpleLock lock

Trying to understand prefix sum execution

I am trying to understand the scan implementation scan-then-fan mentioned in the book: The CUDA Handbook.
Can some one explain the device function scanWarp? Why negative indexes? Could you please mention a numerical example?
I have the same question about for the line warpPartials[16+warpid] = sum. How the assignment is happening?
Which is the contribution of this line if ( warpid==0 ) {scanWarp<T,bZeroPadded>( 16+warpPartials+tid ); }
Could you please someone explain sum += warpPartials[16+warpid-1]; ? An numerical example will be highly appreciated.
Finally, a more c++ oriented question how do we know the indexes that are used in *sPartials = sum; to store values in sPartials?
PS: A numerical example that demonstrates the whole execution would be very helpful.
template < class T, bool bZeroPadded >
inline __device__ T
scanBlock( volatile T *sPartials ){
extern __shared__ T warpPartials[];
const int tid = threadIdx.x;
const int lane = tid & 31;
const int warpid = tid >> 5;
//
// Compute this thread's partial sum
//
T sum = scanWarp<T,bZeroPadded>( sPartials );
__syncthreads();
//
// Write each warp's reduction to shared memory
//
if ( lane == 31 ) {
warpPartials[16+warpid] = sum;
}
__syncthreads();
//
// Have one warp scan reductions
//
if ( warpid==0 ) {
scanWarp<T,bZeroPadded>( 16+warpPartials+tid );
}
__syncthreads();
//
// Fan out the exclusive scan element (obtained
// by the conditional and the decrement by 1)
// to this warp's pending output
//
if ( warpid > 0 ) {
sum += warpPartials[16+warpid-1];
}
__syncthreads();
//
// Write this thread's scan output
//
*sPartials = sum;
__syncthreads();
//
// The return value will only be used by caller if it
// contains the spine value (i.e. the reduction
// of the array we just scanned).
//
return sum;
}
template < class T >
inline __device__ T
scanWarp( volatile T *sPartials ){
const int tid = threadIdx.x;
const int lane = tid & 31;
if ( lane >= 1 ) sPartials[0] += sPartials[- 1];
if ( lane >= 2 ) sPartials[0] += sPartials[- 2];
if ( lane >= 4 ) sPartials[0] += sPartials[- 4];
if ( lane >= 8 ) sPartials[0] += sPartials[- 8];
if ( lane >= 16 ) sPartials[0] += sPartials[-16];
return sPartials[0];
}

The scan-then-fan strategy is applied at two levels. For the grid-level scan (which operates on global memory), partials are written to the temporary global memory buffer allocated in the host code, scanned by recursively calling the host function, then added to the eventual output with a separate kernel invocation. For the block-level scan (which operates on shared memory), partials are written to the base of shared memory (warpPartials[]), scanned by one warp, then added to the eventual output of the block-level scan. The code that you are asking about is doing the block-level scan.
The implementation of scanWarp that you are referencing is called with a shared memory pointer that has already had threadIdx.x added to it, so each thread's version of sPartials points to a different shared memory element. Using a fixed index on sPartials causes adjacent threads to operate on adjacent shared memory elements. Negative indices are okay as long as they do not result in out-of-bounds array indexing. This implementation borrowed from the optimized version that pads shared memory with zeros, so every thread can unconditionally use a fixed negative index and threads below a certain index just read zeros. (Listing 13.14) It could just as easily have predicated execution on the lowest threads in the warp and used positive indices.
The 31st thread of each 32-thread warp contains that warp's partial sum, which has to be stored somewhere in order to be scanned and then added to the output. warpPartials[] aliases shared memory from the first element, so can be used to hold each warp's partial sum. You could use any part of shared memory to do this calculation, because each thread already has its own scan value in registers (the assignment T sum = scanWarp...).
Some warp (it could be any warp, so it might as well be warp 0) has to scan the partials that were written to warpPartials[]. At most one warp is needed because there is a hardware limitation of 1024 threads per block = 1024/32 or 32 warps. So this code is taking advantage of the coincidence that the maximum number of threads per block, divided by the warp count, is no larger than the maximum number of threads per warp.
This code is adding the scanned per-warp partials to each output element. The first warp already has the correct values, so the addition is done only by the second and subsequent warps. Another way to look at this is that it's adding the exclusive scan of the warp partials to the output.
scanBlock is a device function - the address arithmetic gets done by its caller, scanAndWritePartials: volatile T *myShared = sPartials+tid;

(Answer rewritten now I have more time)
Here's an example (based on an implementation I wrote in C++ AMP, not CUDA). To make the diagram smaller each warp is 4 elements wide and a block is 16 elements.
The following paper is also pretty useful Efficient Parallel Scan Algorithms for GPUs. As is Parallel Scan for Stream Architectures.

CUDA shared memory programming is not working

all:
I am learning how shared memory accelerates the GPU programming process. I am using the codes below to calculate the squared value of each element plus the squared value of the average of its left and right neighbors.
The code runs, however, the result is not as expected.
The first 10 result printed out is 0,1,2,3,4,5,6,7,8,9, while I am expecting the result as 25,2,8, 18,32,50,72,98,128,162;
The code is as follows, with the reference to here;
Would you please tell me which part goes wrong? Your help is very much appreciated.
#include <stdio.h>
#include <stdlib.h>
#include <iostream>
#include <cuda.h>
const int N=1024;
__global__ void compute_it(float *data)
{
int tid = threadIdx.x;
__shared__ float myblock[N];
float tmp;
// load the thread's data element into shared memory
myblock[tid] = data[tid];
// ensure that all threads have loaded their values into
// shared memory; otherwise, one thread might be computing
// on unitialized data.
__syncthreads();
// compute the average of this thread's left and right neighbors
tmp = (myblock[tid>0?tid-1:(N-1)] + myblock[tid<(N-1)?tid+1:0]) * 0.5f;
// square the previousr result and add my value, squared
tmp = tmp*tmp + myblock[tid]*myblock[tid];
// write the result back to global memory
data[tid] = myblock[tid];
__syncthreads();
}
int main (){
char key;
float *a;
float *dev_a;
a = (float*)malloc(N*sizeof(float));
cudaMalloc((void**)&dev_a,N*sizeof(float));
for (int i=0; i<N; i++){
a [i] = i;
}
cudaMemcpy(dev_a, a, N*sizeof(float), cudaMemcpyHostToDevice);
compute_it<<<N,1>>>(dev_a);
cudaMemcpy(a, dev_a, N*sizeof(float), cudaMemcpyDeviceToHost);
for (int i=0; i<10; i++){
std::cout<<a [i]<<",";
}
std::cin>>key;
free (a);
free (dev_a);

One of the most immediate problems in your kernel code is this:
data[tid] = myblock[tid];
I think you probably meant this:
data[tid] = tmp;
In addition, you're launching 1024 blocks of one thread each. This isn't a particularly effective way to use the GPU and it means that your tid variable in every threadblock is 0 (and only 0, since there is only one thread per threadblock.)
There are many problems with this approach, but one immediate problem will be encountered here:
tmp = (myblock[tid>0?tid-1:(N-1)] + myblock[tid<31?tid+1:0]) * 0.5f;
Since tid is always zero, and therefore no other values in your shared memory array (myblock) get populated, the logic in this line cannot be sensible. When tid is zero, you are selecting myblock[N-1] for the first term in the assignment to tmp, but myblock[1023] never gets populated with anything.
It seems that you don't understand various CUDA hierarchies:
a grid is all threads associated with a kernel launch
a grid is composed of threadblocks
each threadblock is a group of threads working together on a single SM
the shared memory resource is a per-SM resource, not a device-wide resource
__synchthreads() also operates on threadblock basis (not device-wide)
threadIdx.x is a built-in variable that provide a unique thread ID for all threads within a threadblock, but not globally across the grid.
Instead you should break your problem into groups of reasonable-sized threadblocks (i.e. more than one thread). Each threadblock will then be able to behave in a fashion that is roughly as you have outlined. You will then need to special-case the behavior at the starting point and ending point (in your data) of each threadblock.
You're also not doing proper cuda error checking which is recommended, especially any time you're having trouble with a CUDA code.
If you make the change I indicated first in your kernel code, and reverse the order of your block and grid kernel launch parameters:
compute_it<<<1,N>>>(dev_a);
As indicated by Kristof, you will get something that comes close to what you want, I think. However you will not be able to conveniently scale that beyond N=1024 without other changes to your code.
This line of code is also not correct:
free (dev_a);
Since dev_a was allocated on the device using cudaMalloc you should free it like this:
cudaFree (dev_a);

Since you have only one thread per block, your tid will always be 0.
Try launching the kernel this way:
compute_it<<<1,N>>>(dev_a);
instead of
compute_it<<>>(dev_a);

MapViewOfFile and VirtualLock

Will the following code load data from file into system memory so that access to the resulting pointer will never block threads?
auto ptr = VirtualLock(MapViewOfFile(file_map, FILE_MAP_READ, high, low, size), size); // Map file to memory and wait for DMA transfer to finish.
int val0 = reinterpret_cast<int*>(ptr)[0]; // Will not block thread?
int val1 = reinterpret_cast<int*>(ptr)[size-4]; // Will not block thread?
VirtualUnlock(ptr);
UnmapViewOfFile(ptr);
EDIT:
Updated after Dammons answer.
auto ptr = MapViewOfFile(file_map, FILE_MAP_READ, high, low, size);
#pragma optimize("", off)
char dummy;
for(int n = 0; n < size; n += 4096)
dummy = reinterpret_cast<char*>(ptr)[n];
#pragma optimize("", on)
int val0 = reinterpret_cast<int*>(ptr)[0]; // Will not block thread?
int val1 = reinterpret_cast<int*>(ptr)[size-4]; // Will not block thread?
UnmapViewOfFile(ptr);

If the file's size is less than the ridiculously small maximum working set size (or, if you have modified your working set size accordingly) then in theory yes. If you exceed your maximum working set size, VirtualLock will simply do nothing (that is, fail).
(In practice, I've seen VirtualLock being rather... liberal... at interpreting what it's supposed to do as opposed to what it actually does, at least under Windows XP -- might be different under more modern versions)
I've been trying similar things in the past, and I'm now simply touching all pages that I want in RAM with a simple for loop (reading one byte). This leaves no questions open and works, with the sole possible exception that a page might in theory get swapped out again after touched. In practice, this never happens (unless the machine is really really low on RAM, and then it's ok to happen).

Random memory accesses are expensive?

During optimizing my connect four game engine I reached a point where further improvements only can be minimal because much of the CPU-time is used by the instruction TableEntry te = mTable[idx + i] in the following code sample.
TableEntry getTableEntry(unsigned __int64 lock)
{
int idx = (lock & 0xFFFFF) * BUCKETSIZE;
for (int i = 0; i < BUCKETSIZE; i++)
{
TableEntry te = mTable[idx + i]; // bottleneck, about 35% of CPU usage
if (te.height == NOTSET || lock == te.lock)
return te;
}
return TableEntry();
}
The hash table mTable is defined as std::vector<TableEntry> and has about 4.2 mil. entrys (about 64 MB). I have tried to replace the vectorby allocating the table with new without speed improvement.
I suspect that accessing the memory randomly (because of the Zobrist Hashing function) could be expensive, but really that much? Do you have suggestions to improve the function?
Thank you!
Edit: BUCKETSIZE has a value of 4. It's used as collision strategy. The size of one TableEntry is 16 Bytes, the struct looks like following:
struct TableEntry
{ // Old New
unsigned __int64 lock; // 8 8
enum { VALID, UBOUND, LBOUND }flag; // 4 4
short score; // 4 2
char move; // 4 1
char height; // 4 1
// -------
// 24 16 Bytes
TableEntry() : lock(0LL), flag(VALID), score(0), move(0), height(-127) {}
};
Summary: The function originally needed 39 seconds. After making the changes jdehaan suggested, the function now needs 33 seconds (the program stops after 100 seconds). It's better but I think Konrad Rudolph is right and the main reason why it's that slow are the cache misses.

You are making copies of your table entry, what about using TableEntry& as a type. For the default value at the bottom a static default TableEntry() will also do. I suppose that is where you lose much time.
const TableEntry& getTableEntry(unsigned __int64 lock)
{
int idx = (lock & 0xFFFFF) * BUCKETSIZE;
for (int i = 0; i < BUCKETSIZE; i++)
{
// hopefuly now less than 35% of CPU usage :-)
const TableEntry& te = mTable[idx + i];
if (te.height == NOTSET || lock == te.lock)
return te;
}
return DEFAULT_TABLE_ENTRY;
}

How big is a table entry? I suspect it's the copy that is expensive not the memory lookup.
Memory accesses are quicker if they are contiguous because of cache hits, but it seem you are doing this.

The point about copying the TableEntry is valid. But let’s look at this question:
I suspect that accessing the memory randomly (…) could be expensive, but really that much?
In a word, yes.
Random memory access with an array of your size is a cache killer. It will generate lots of cache misses which can be up to three orders of magnitude slower than access to memory in cache. Three orders of magnitude – that’s a factor 1000.
On the other hand, it actually looks as though you are using lots of array elements in order, even though you generated your starting point using a hash. This speaks against the cache miss theory, unless your BUCKETSIZE is tiny and the code gets called very often with different lock values from the outside.

I have seen this exact problem with hash tables before. The problem is that continuous random access to the hashtable touch all of the memory used by the table (both the main array and all of the elements). If this is large relative to your cache size you will thrash. This manifests as the exact problem you are encountering: That instruction which first references new memory appears to have a very high cost due to the memory stall.
In the case I worked on, a further issue was that the hash table represented a rather small part of the key space. The "default" value (similar to what you call DEFAULT_TABLE_ENTRY) applied to the vast majority of keys so it seemed like the hash table was not heavily used. The problem was that although default entries avoided many inserts, the continuous action of searching touched every element of the cache over and over (and in random order). In that case I was able to move the values from the hashed data to live with the associated structure. It took more overall space because even keys with the default value had to explicitly store the default value, but the locality of reference was vastly improved and the performance gain was huge.

Use pointers
TableEntry* getTableEntry(unsigned __int64 lock) {
int idx = (lock & 0xFFFFF) * BUCKETSIZE;
TableEntry* max = &mTable[idx + BUCKETSIZE];
for (TableEntry* te = &mTable[idx]; te < max; te++)
{
if (te->height == NOTSET || lock == te->lock)
return te;
}
return DEFAULT_TABLE_ENTRY; }

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js