Fftw3 library and plans reuse - c++

I'm about to use fftw3 library in my very certain task.
I have a heavy load packets stream with variable frame size, which is produced like that:
while(thereIsStillData){
copyDataToInputArray();
createFFTWPlan();
performExecution();
destroyPlan();
}
Since creating plans is rather expensive, I want to modify my code to something like this:
while(thereIsStillData){
if(inputArraySizeDiffers()) destroyOldAndCreateNewPlan();
copyDataToInputArray(); // e.g. `memcpy` or `std::copy`;
performExecution();
}
Can I do this? I mean, does plan contain some important information based on data such, that plan created for one array with size N, when executed will give incorrect results for the other array of same size N.

The fftw_execute() function does not modify the plan presented to it, and can be called multiple times with the same plan. Note, however, that the plan contains pointers to the input and output arrays, so if copyDataToInputArray() involves creating a different input (or output) array then you cannot afterwards use the old plan in fftw_execute() to transform the new data.
FFTW does, however, have a set of "New-array Execute Functions" that could help here, supposing that the new arrays satisfy some additional similarity criteria with respect to the old (see linked docs for details).
The docs do recommend:
If you are tempted to use the new-array execute interface because you want to transform a known bunch of arrays of the same size, you should probably go use the advanced interface instead
but that's talking about transforming multiple arrays that are all in memory simultaneously, and arranged in a regular manner.
Note, too, that if your variable frame size is not too variable -- that is, if it is always one of a relatively small number of choices -- then you could consider keeping a separate plan in memory for each frame size instead of recomputing a plan every time one frame's size differs from the previous one's.

Related

Variable Length Array Performance Implications (C/C++)

I'm writing a fairly straightforward function that sends an array over to a file descriptor. However, in order to send the data, I need to append a one byte header.
Here is a simplified version of what I'm doing and it seems to work:
void SendData(uint8_t* buffer, size_t length) {
uint8_t buffer_to_send[length + 1];
buffer_to_send[0] = MY_SPECIAL_BYTE;
memcpy(buffer_to_send + 1, buffer, length);
// more code to send the buffer_to_send goes here...
}
Like I said, the code seems to work fine, however, I've recently gotten into the habit of using the Google C++ style guide since my current project has no set style guide for it (I'm actually the only software engineer on my project and I wanted to use something that's used in industry). I ran Google's cpplint.py and it caught the line where I am creating buffer_to_send and threw some comment about not using variable length arrays. Specifically, here's what Google's C++ style guide has to say about variable length arrays...
http://google-styleguide.googlecode.com/svn/trunk/cppguide.xml#Variable-Length_Arrays_and_alloca__
Based on their comments, it appears I may have found the root cause of seemingly random crashes in my code (which occur very infrequently, but are nonetheless annoying). However, I'm a bit torn as to how to fix it.
Here are my proposed solutions:
Make buffer_to_send essentially a fixed length array of a constant length. The problem that I can think of here is that I have to make the buffer as big as the theoretically largest buffer I'd want to send. In the average case, the buffers are much smaller, and I'd be wasting about 0.5KB doing so each time the function is called. Note that the program must run on an embedded system, and while I'm not necessarily counting each byte, I'd like to use as little memory as possible.
Use new and delete or malloc/free to dynamically allocate the buffer. The issue here is that the function is called frequently and there would be some overhead in terms of constantly asking the OS for memory and then releasing it.
Use two successive calls to write() in order to pass the data to the file descriptor. That is, the first write would pass only the one byte, and the next would send the rest of the buffer. While seemingly straightforward, I would need to research the code a bit more (note that I got this code handed down from a previous engineer who has since left the company I work for) in order to guarantee that the two successive writes occur atomically. Also, if this requires locking, then it essentially becomes more complex and has more performance impact than case #2.
Note that I cannot make the buffer_to_send a member variable or scope it outside the function since there are (potentially) multiple calls to the function at any given time from various threads.
Please let me know your opinion and what my preferred approach should be. Thanks for your time.
You can fold the two successive calls to write() in your option 3 into a single call using writev().
http://pubs.opengroup.org/onlinepubs/009696799/functions/writev.html
I would choose option 1. If you know the maximum length of your data, then allocate that much space (plus one byte) on the stack using a fixed size array. This is no worse than the variable length array you have shown because you must always have enough space left on the stack otherwise you simply won't be able to handle your maximum length (at worst, your code would randomly crash on larger buffer sizes). At the time this function is called, nothing else will be using the further space on your stack so it will be safe to allocate a fixed size array.

MPI synchronize matrix of vectors

Excuse me if this question is common or trivial, I am not very familiar with MPI so bear with me.
I have a matrix of vectors. Each vector is empty or has a few items in it.
std::vector<someStruct*> partitions[matrix_size][matrix_size];
When I start the program each process will have the same data in this matrix, but as the code progresses each process might remove several items from some vectors and put them in other vectors.
So when I reach a barrier I somehow have to make sure each process has the latest version of this matrix. The big problem is that each process might manipulate any or all vectors.
How would I go about to make sure that every process has the correct updated matrix after the barrier?
EDIT:
I am sorry I was not clear. Each process may move one or more objects to another vector but only one process may move each object. In other words each process has a list of objects it may move, but the matrix may be altered by everyone. And two processes can't move the same object ever.
In that case you'll need to send messages using MPI_Bcast that inform the other processors about this and instruct them to do the same. Alternatively, if the ordering doesn't matter until you hit the barrier, you can only send the messages to the root process which performs the permutations and then after the barrier sends it to all the others using MPI_Bcast.
One more thing: vectors of pointers are usually quite a bad idea, as you'll need to manage the memory manually in there. If you can use C++11, use std::unique_ptr or std::shared_ptr instead (depending on what your semantics are), or use Boost which provides very similar facilities.
And lastly, representing a matrix as a fixed-size array of fixed-size arrays is readlly bad. First: the matrix size is fixed. Second: adjacent rows are not necessarily stored in contiguous memory, slowing your program down like crazy (it literally can be orders of magnitudes). Instead represent the matrix as a linear array of size Nrows*Ncols, and then index the elements as Nrows*i + j where Nrows is the number of rows and i and j are the row and column indices, respectively. If you don't want column-major storage instead, address the elements by i + Ncols*j. You can wrap this index-juggling in inline functions that have virtually zero overhead.
I would suggest to lay out the data differently:
Each process has a map of his objects and their position in the matrix. How that is implemented depends on how you identify objects. If all local objects are numbered, you could just use a vector<pair<int,int>>.
Treat that as the primary structure you manipulate and communicate that structure with MPI_Allgather (each process sends it data to all other processes, at the end everyone has all data). If you need fast lookup by coordinates, then you can build up a cache.
That may or may not be performing well. Other optimizations (like sharing 'transactions') totally depend on your objects and the operations you perform on them.

How to get a list (or subset) from an OpenCL Kernel?

I have a large array with 2^20 ulongs on it. This little OpenCL kernel flows through it like a charm. Yet, I have absolutely no idea (and google hasn't helped here) how to return a small number of items (2^10) from it.
What I'm looking for is a fixed-sized list with at most 1024 items that have hamming distance (popcount) smaller than a given number. The list order doesn't matter, so perhaps I should be asking for a subset of these 2**20 items.
Since the output is expected to be much smaller than the input, using a global index in the output through atomic access will not be too ineffective. You need to pass a buffer containing a single uint, initially set to 0:
__kernel void K(...,__global uint * outIndex,...)
{
...
if (selected)
{
uint index = atomic_inc(outIndex); // or atom_inc if using OpenCL 1.0 extension
out[index] = value;
}
}
A list as such is not supported with OpenCL. OpenCL is a kind of standard C with some extensions and some limitation. You can only operate on buffers (aka arrays).
What you might look for is a global memory buffer which you need to allocate before you run the kernel. In this you can put your results in and with an clEnqueueReadBuffer you can retrieve your results.
Well, there is a way, through some hacks. I forked pyopencl.algorithm and created a new method, sparse_copy_if(), that returns the exact-sized buffer I need, as if it were a list with items being appended to. I will document it, and submit a patch to Andreas.
If your buffers are too large, though, there is a way to improve performance even more: I followed Rick's suggestion above, created a hash table, and threw the desired results in there. (Note that there's always risk of collision, so the hash table buffer/array has to be orders of magnitude larger than your expected output).
Then, I run sparse_copy_if() on the hash table buffer and receive nothing but a perfectly-sized buffer.
In conclusion:
I have a kernel scanning a 1,000,000-sized buffer. It computes results for all of them but doesn't separate the results I want.
These desired results are then thrown in a ~25,000 buffer (hash table, significantly smaller then the original data).
Then, by running sparse_copy_if() on the hash table buffer, you get the desired output---almost as if it were a list in which items could have been appended to.
sparse_copy_if(), of course, has the overhead of creating the perfectly-sized buffers, and copying data to them. But I've found that this overhead generally compensates, as you are making now (low-latency) transfers of small buffers/arrays from device back to host.
Code for testing sparse_copy_if() performance versus copy_if().

Size/Resize of GHashTable

Here is my use case: I want to use glib's GHashTable and use IP-addresses as keys, and the olume of data sent/received by this IP-address as the value. For instance I succeeded to implement the whole issue in user-space using some kernel variables in order to look to the volume per IP-address.
Now the question is: Suppose I have a LOT of IP-addresses (i.e. 500,000 up to 1,000,000 uniques) => it is really not clear what is the space allocated and the first size that was given to a new hash table created when using (g_hash_table_new()/g_hash_table_new_full()), and how the whole thing works in the background. It is known that when resizing a hash table it can take a lot of time. So how can we play with these parameters?
Neither g_hash_table_new() nor g_hash_table_new_full() let you specify the size.
The size of a hash table is only available as the number of values stored in it, you don't have access to the actual array size that typically is used in the implementation.
However, the existance of g_spaced_primes_closest() kind of hints that glib's hash table uses a prime-sized internal array.
I would say that although a million keys is quite a lot, it's not extraordinary. Try it, and then measure the performance to determine if it's worth digging deeper.

mmap-loadable data structure library for C++ (or C)

I have a some large data structure (N > 10,000) that usually only needs to be created once (at runtime), and can be reused many times afterwards, but it needs to be loaded very quickly. (It is used for user input processing on iPhoneOS.) mmap-ing a file seems to be the best choice.
Are there any data structure libraries for C++ (or C)? Something along the line
ReadOnlyHashTable<char, int> table ("filename.hash");
// mmap(...) inside the c'tor
...
int freq = table.get('a');
...
// munmap(...); inside the d'tor.
Thank you!
Details:
I've written a similar class for hash table myself but I find it pretty hard to maintain, so I would like to see if there's existing solutions already. The library should
Contain a creation routine that serialize the data structure into file. This part doesn't need to be fast.
Contain a loading routine that mmap a file into read-only (or read-write) data structure that can be usable within O(1) steps of processing.
Use O(N) amount of disk/memory space with a small constant factor. (The device has serious memory constraint.)
Small time overhead to accessors. (i.e. the complexity isn't modified.)
Assumptions:
Bit representation of data (e.g. endianness, encoding of float, etc.) does not matter since it is only used locally.
So far the possible types of data I need are integers, strings, and struct's of them. Pointers do not appear.
P.S. Can Boost.intrusive help?
You could try to create a memory mapped file and then create the STL map structure with a customer allocator. Your customer allocator then simply takes the beginning of the memory of the memory mapped file, and then increments its pointer according to the requested size.
In the end all the allocated memory should be within the memory of the memory mapped file and should be reloadable later.
You will have to check if memory is free'd by the STL map. If it is, your customer allocator will lose some memory of the memory mapped file but if this is limited you can probably live with it.
Sounds like maybe you could use one of the "perfect hash" utilities out there. These spend some time opimising the hash function for the particular data, so there are no hash collisions and (for minimal perfect hash functions) so that there are no (or at least few) empty gaps in the hash table. Obviously, this is intended to be generated rarely but used frequently.
CMPH claims to cope with large numbers of keys. However, I have never used it.
There's a good chance it only generates the hash function, leaving you to use that to generate the data structure. That shouldn't be especially hard, but it possibly still leaves you where you are now - maintaining at least some of the code yourself.
Just thought of another option - Datadraw. Again, I haven't used this, so no guarantees, but it does claim to be a fast persistent database code generator.
WRT boost.intrusive, I've just been having a look. It's interesting. And annoying, as it makes one of my own libraries look a bit pointless.
I thought this section looked particularly relevant.
If you can use "smart pointers" for links, presumably the smart pointer type can be implemented using a simple offset-from-base-address integer (and I think that's the point of the example). An array subscript might be equally valid.
There's certainly unordered set/multiset support (C++ code for hash tables).
Using cmph would work. It does have the serialization machinery for the hash function itself, but you still need to serialize the keys and the data, besides adding a layer of collision resolution on top of it if your query set universe is not known before hand. If you know all keys before hand, then it is the way to go since you don't need to store the keys and will save a lot of space. If not, for such a small set, I would say it is overkill.
Probably the best option is to use google's sparse_hash_map. It has very low overhead and also has the serialization hooks that you need.
http://google-sparsehash.googlecode.com/svn/trunk/doc/sparse_hash_map.html#io
GVDB (GVariant Database), the core of Dconf is exactly this.
See git.gnome.org/browse/gvdb, dconf and bv
and developer.gnome.org/glib/2.30/glib-GVariant.html