How to get a list (or subset) from an OpenCL Kernel?

How to get a list (or subset) from an OpenCL Kernel? - list

I have a large array with 2^20 ulongs on it. This little OpenCL kernel flows through it like a charm. Yet, I have absolutely no idea (and google hasn't helped here) how to return a small number of items (2^10) from it.
What I'm looking for is a fixed-sized list with at most 1024 items that have hamming distance (popcount) smaller than a given number. The list order doesn't matter, so perhaps I should be asking for a subset of these 2**20 items.

Since the output is expected to be much smaller than the input, using a global index in the output through atomic access will not be too ineffective. You need to pass a buffer containing a single uint, initially set to 0:
__kernel void K(...,__global uint * outIndex,...)
{
...
if (selected)
{
uint index = atomic_inc(outIndex); // or atom_inc if using OpenCL 1.0 extension
out[index] = value;
}
}

A list as such is not supported with OpenCL. OpenCL is a kind of standard C with some extensions and some limitation. You can only operate on buffers (aka arrays).
What you might look for is a global memory buffer which you need to allocate before you run the kernel. In this you can put your results in and with an clEnqueueReadBuffer you can retrieve your results.

Well, there is a way, through some hacks. I forked pyopencl.algorithm and created a new method, sparse_copy_if(), that returns the exact-sized buffer I need, as if it were a list with items being appended to. I will document it, and submit a patch to Andreas.
If your buffers are too large, though, there is a way to improve performance even more: I followed Rick's suggestion above, created a hash table, and threw the desired results in there. (Note that there's always risk of collision, so the hash table buffer/array has to be orders of magnitude larger than your expected output).
Then, I run sparse_copy_if() on the hash table buffer and receive nothing but a perfectly-sized buffer.
In conclusion:
I have a kernel scanning a 1,000,000-sized buffer. It computes results for all of them but doesn't separate the results I want.
These desired results are then thrown in a ~25,000 buffer (hash table, significantly smaller then the original data).
Then, by running sparse_copy_if() on the hash table buffer, you get the desired output---almost as if it were a list in which items could have been appended to.
sparse_copy_if(), of course, has the overhead of creating the perfectly-sized buffers, and copying data to them. But I've found that this overhead generally compensates, as you are making now (low-latency) transfers of small buffers/arrays from device back to host.
Code for testing sparse_copy_if() performance versus copy_if().

Related

Fftw3 library and plans reuse

I'm about to use fftw3 library in my very certain task.
I have a heavy load packets stream with variable frame size, which is produced like that:
while(thereIsStillData){
copyDataToInputArray();
createFFTWPlan();
performExecution();
destroyPlan();
}
Since creating plans is rather expensive, I want to modify my code to something like this:
while(thereIsStillData){
if(inputArraySizeDiffers()) destroyOldAndCreateNewPlan();
copyDataToInputArray(); // e.g. `memcpy` or `std::copy`;
performExecution();
}
Can I do this? I mean, does plan contain some important information based on data such, that plan created for one array with size N, when executed will give incorrect results for the other array of same size N.

The fftw_execute() function does not modify the plan presented to it, and can be called multiple times with the same plan. Note, however, that the plan contains pointers to the input and output arrays, so if copyDataToInputArray() involves creating a different input (or output) array then you cannot afterwards use the old plan in fftw_execute() to transform the new data.
FFTW does, however, have a set of "New-array Execute Functions" that could help here, supposing that the new arrays satisfy some additional similarity criteria with respect to the old (see linked docs for details).
The docs do recommend:
If you are tempted to use the new-array execute interface because you want to transform a known bunch of arrays of the same size, you should probably go use the advanced interface instead
but that's talking about transforming multiple arrays that are all in memory simultaneously, and arranged in a regular manner.
Note, too, that if your variable frame size is not too variable -- that is, if it is always one of a relatively small number of choices -- then you could consider keeping a separate plan in memory for each frame size instead of recomputing a plan every time one frame's size differs from the previous one's.

What is the Optimal Memory Setup for OpenCL where the host needs access at regular time steps?

I'm looking to find the best way to setup the CL memory objects for my project, which does a device side physics simulation. The buffers will be accessed by the host every frame, approx every 16ms, to get the updated data for rendering. Unfortunately, I cannot send the new data straight to the GPU via a VBO.
The data in the buffer consists of structs with 3 cl_float4's and one cl_float. I also want to have the ability for the host to update some of the structs in the buffer, this will not be per-frame.
Currently I'm looking to have all the data be allocated/stored on the GPU and using map/unmap whenever the host requires access. But this brings up two issues that I can see:
Still require a device to host copy for rendering
Buffer must be rebuilt whenever objects are added/removed from the simulation. Or additional validation data must exist per struct to check if this object is "alive"/valid...
Any advice is appreciated. If you need any additional info or code snippets, just let me know.
Thank you.

Algorithm for efficient memory management
You're asking for the best setup for OpenCL memory. I'm assuming you mostly care about high performance and not too much of a much of a size overhead.
This means you should perform as many operations as possible on the GPU. Syncing between CPU/GPU should be minimized.
Memory model
I will now describe in detail how such a memory and processing model should look like.
Preallocate buffers with the maximum size and fill them over time.
Track how many elements currently are in the buffer
Have separate buffers for validity and your data. The validity buffer indicates the validity for each element
Adding elements
Adding elements can be done via the following principle:
Have a buffer with host pointer for input data. The size of the buffer is determined by the maximum number of input elements
When you receive data, copy it onto the host buffer and sync it to the GPU
(Optional) Preprocess input data on the GPU
In a kernel, add input data and corresponding validity behind the last element in the global buffer. Input points that are empty (maybe you just got 100 input points instead of 10000), just mark them as invalid.
This has several effects:
Adding can be completely done in parallel
You only have to sync a small buffer (input data buffer) to the GPU
When adding input data, you always add the maximum amount of input elements into the buffer, but most of them will be empty/invalid. So when you frequently add points
If your rendering step is not able to discard invalid points, you must remove invalid points from the model before rendering.
Otherwise, you can postpone cleaning up to a point, where it is only needed because the size of the model becomes to big and threatens to overflow.
Removing elements
Removing elements should be done via the following principle:
Have a kernel that determines if an elements becomes invalid. If so, just mark its validity accordingly (if you want you can zero nor NAN out the data, too, but that is not necessary).
Have an algorithm that is able to remove invalid elements from the buffer and give you the information about the number of valid,
consecutive elements in the buffer (that information is needed when adding elements).
Such an algorithm will require you to perform sorts and a search using parallel reduction.
Sorting elements in parallel
Sorting a buffer, especially one with many elements is highly demanding. You should use available implementations to do so.
Simple Bitonic sort:
If you do not care about the maximum possible performance and simple code, this is your choice.
Implementation available: https://software.intel.com/en-us/articles/bitonic-sorting
Simple to integrate, just a single kernel.
Can only sort 4*2^n elements (as far as I remember).
WARNING: This sort does not work with numbers larger than one billion (1,000,000,000). Not sure why but finding that out cost me quite some time.
Fast radix sort:
If you care about maximum performance and have lots of elements to sort (1 million up to 1 billion or even more), this is your choice.
Implementation available: https://github.com/sschaetz/nvidia-opencl-examples/tree/master/OpenCL/src/oclRadixSort
More difficult to integrate, serveral kernel calls
Can only sort 2^n elements (as far as I remember)
Faster than Bitonic sort, especially with more than 1 million elements
Finding out the number of valid elements
If the buffer has been sorted and all invalid elements have been removed, you could simply parallely count the number of valid values, or simply find the first index of the first invalid element (this requires you to have unused buffer space invalidated). Both ways will give you the number of valid elements
Problem size vs. sorting size restrictions
To overcome the problems that arise with only being able to sort a fixed number of elements, just pad out with values whose sorting behavior you know.
Example:
You want to sort 10,000 integers with values between 0 and 10 million in ascending order.
You can only sort 2^n elements
The closest you will get is 2^14 = 16384.
Have a buffer for sorting with 2^14 elements
Fill the buffer with the 10000 values to sort.
Fill all remaining values of the buffer with a value you know will be sorted behind the 10,000 actually existing values.
Since you know your value range (0 to 10 million), you could pick 11 million as filling value.
In-place sorting problem
In-place sorting and removing of elements is difficult (but possible) to implement. An easier solution is to determine the indices of consecutive valid elements and write them to a new buffer in that order and then swap buffers.
But this requires you to swap buffers or copy back which costs both performance and space. Chose the lesser evil in your case.
More advice
Only add wait-events, if you are still not content with the performance. However, this will complicate your code and possibly introduce bugs (which won't even be your fault - there is a nasty bug with Nvidia cards and OpenCL where wait-events are not destoyed and memory leaks - this will slowly but surely cause problems).
Be very careful with syncing/mapping buffers to CPU too early, as this sync-call will force all kernels using this buffer to finish
If adding elements rarely occurs, and your rendering step is able to discard invalid elements, you can postpone removing elements from the buffer until it is really needed (too many elements threaten to overflow your buffer).

Is there any workaround to "reserve" a cache fraction?

Assume I have to write a C or C++ computational intensive function that has 2 arrays as input and one array as output. If the computation uses the 2 input arrays more often than it updates the output array, I'll end up in a situation where the output array seldom gets cached because it's evicted in order to fetch the 2 input arrays.
I want to reserve one fraction of the cache for the output array and enforce somehow that those lines don't get evicted once they are fetched, in order to always write partial results in the cache.
Update1(output[]) // Output gets cached
DoCompute1(input1[]); // Input 1 gets cached
DoCompute2(input2[]); // Input 2 gets cached
Update2(output[]); // Output is not in the cache anymore and has to get cached again
...
I know there are mechanisms to help eviction: clflush, clevict, _mm_clevict, etc. Are there any mechanisms for the opposite?
I am thinking of 3 possible solutions:
Using _mm_prefetch from time to time to fetch the data back if it has been evicted. However this might generate unnecessary traffic plus that I need to be very careful to when to introduce them;
Trying to do processing on smaller chunks of data. However this would work only if the problem allows it;
Disabling hardware prefetchers where that's possible to reduce the rate of unwanted evictions.
Other than that, is there any elegant solution?

Intel CPUs have something called No Eviction Mode (NEM) but I doubt this is what you need.
While you are attempting to optimise the second (unnecessary) fetch of output[], have you given thought to using SSE2/3/4 registers to store your intermediate output values, update them when necessary, and writing them back only when all updates related to that part of output[] are done?
I have done something similar while computing FFTs (Fast Fourier Transforms) where part of the output is in registers and they are moved out (to memory) only when it is known they will not be accessed anymore. Until then, all updates happen to the registers. You'll need to introduce inline assembly to effectively use SSE* registers. Of course, such optimisations are highly dependent on the nature of the algorithm and data placement.

I am trying to get a better understanding of the question:
If it is true that the 'output' array is strictly for output, and you never do something like
output[i] = Foo(newVal, output[i]);
then, all elements in output[] are strictly write. If so, all you would ever need to 'reserve' is one cache-line. Isn't that correct?
In this scenario, all writes to 'output' generate cache-fills and could compete with the cachelines needed for 'input' arrays.
Wouldn't you want a cap on the cachelines 'output' can consume as opposed to reserving a certain number of lines.

I see two options, which may or may not work depending on the CPU you are targeting, and on your precise program flow:
If output is only written to and not read, you can use streaming-stores, i.e., a write instruction with a no-read hint, so it will not be fetched into cache.
You can use prefetching with a non-temporally-aligned (NTA) hint for input. I don't know how this is implemented in general, but I know for sure that on some Intel CPUs (e.g., the Xeon Phi) each hardware thread uses a specific way of cache for NTA data, i.e., with an 8-way cache 1/8th per thread.

I guess solution to this is hidden inside, the algorithm employed and the L1 cache size and cache line size.
Though I am not sure how much performance improvement we will see with this.
We can probably introduce artificial reads which cleverly dodge compiler and while execution, do not hurt computations as well. Single artificial read should fill cache lines as many needed to accommodate one page. Therefore, algorithm should be modified to compute blocks of output array. Something like the ones used in matrix multiplication of huge matrices, done using GPUs. They use blocks of matrices for computation and writing result.
As pointed out earlier, the write to output array should happen in a stream.
To bring in artificial read, we should initialize at compile time the output array at right places, once in each block, probably with 0 or 1.

Searching in large memory mapped files

I have a large data structure stored in memory mapped file. Data structure is very simple:
struct Header {
...some metadata...
uint32_t index_size;
uint64_t index[]
};
This header is placed in the beginning of the file, it uses a structure hack - variable sized structure, size of the last element is not set in stone and can be changed.
char* mmaped_region = ...; // This memory comes from memory mapped file!
Header* pheader = reinterpret_cast<Header*>(mmaped_region);
Memory mapped region starts with Header and Header::index_size contains correct length of the Header::index array. This array contains offsets of the data elements, we can do this:
uint64_t offset = pheader->index[x];
DataItem* item = reinterpret_cast<DataItem*>(mmaped_region + offset);
// At this point, variable item contains pointer to data element
// if variable x contains correct index value (less than pheader->index_size)
All the data elements is sorted (less than relation defined for data elements). Their are stored in the same memory mapped region as Header but starting from the end to the beginning. Data elements can't be moved, because their are of variable size, instead of that - indexes in header are moved during sort procedure. This is very much like B-tree page in modern databases, index array is usually called an indirection vector.
Searches
This data-structure is searched with interpolation search algorithm (with limited amount of steps) and than with binary search. First, I have a whole index array to search, I'm trying to calculate - where searched element can be stored if distribution is uniform. I get some calculated index - look at element at this index and it usually doesn't match. Than I narrow the search range and repeat. Number of interpolation search steps is limited by some small number. After that data-structure is searched with binary search. This works very good with small data-sets, because distribution is usually uniform. Few iterations of the interpolation search and we're done.
Problem definition.
Memory mapped region can be very large in reality. For testing I use 32Gb file backed storage and search for some random keys. This is very slow because this pattern cause lot of random disk reads (all data can't be cached in memory).
What can be done here? I think that setting MADV_RANDOM with madvise syscall can help, but probably not very much. I want to get on par with B-tree search speed. Maybe it is possible to use mincore syscall to check what data-elements can be painlessly checked during interpolation search? Maybe I can use prefetching of some sort?

The interpolation search appears to be a good idea here. It usually has a small benefit, but in this case even a small number of iterations saved helps a lot since they're s slow (disk I/O).
However, real databases duplicate the actual key values in their indices. The space overhead for that is fully justified in the performance improvement. Btrees are a further improvement because they pack multiple related nodes in a single contiguous block of memory, further reducing disk seeks.
This is probably the correct solution for you as well. You should duplicate the keys to avoid disk I/O. You can probably get away by duplicating the keys in a separate structure and keeping that that fully in memory, if you can't alter the existing header.
A compromise is possible, where you just cache the top (2^N)-1 keys for the first N levels of binary search. That means you have to give up your interpolation for that part of the search, but as noted before interpolation is not a huge win anyway. The disk seeks saved will easily pay off. Even caching just the median key (N=1) will already save you one disk seek per lookup. And you can still use interpolation once you've run out of the cache.
In comparison, any attempt to fiddle with memory mapping parameters will give you a few percent speed improvement at best. "On par with B-trees" is not going to happen. If your algorithm needs those physical seeks, you lose. No magical pixie dust will fix a bad algorithm or a bad datastructure.

Size/Resize of GHashTable

Here is my use case: I want to use glib's GHashTable and use IP-addresses as keys, and the olume of data sent/received by this IP-address as the value. For instance I succeeded to implement the whole issue in user-space using some kernel variables in order to look to the volume per IP-address.
Now the question is: Suppose I have a LOT of IP-addresses (i.e. 500,000 up to 1,000,000 uniques) => it is really not clear what is the space allocated and the first size that was given to a new hash table created when using (g_hash_table_new()/g_hash_table_new_full()), and how the whole thing works in the background. It is known that when resizing a hash table it can take a lot of time. So how can we play with these parameters?

Neither g_hash_table_new() nor g_hash_table_new_full() let you specify the size.
The size of a hash table is only available as the number of values stored in it, you don't have access to the actual array size that typically is used in the implementation.
However, the existance of g_spaced_primes_closest() kind of hints that glib's hash table uses a prime-sized internal array.
I would say that although a million keys is quite a lot, it's not extraordinary. Try it, and then measure the performance to determine if it's worth digging deeper.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js