Using Simulink Coder - atomic change of multidimensional parameters (matrix, vector) - c++

I am using Simulink and Simulink Coder to generate a dll of arbitrary Models. My C Application uses the mathworks CAPI.
It runs arbitrary models (hard real time below 1 ms) and is able to modify any parameters of the model (by using the tunable parameters).
For simlpe scalar values I am obtaining the adress of the value.
Pseudocode:
void* simplegain = rtwCAPI_GetSignalAddrIdx()
*simplegain=42;
Everything runs fine. However, this approach can not be applied if I want an atomic change of complete vector and matrix.
For multidimensional Data I used memcopy to write all values from a destination to the result of GetSignalAddIdx(). Measurements have shown that using memcopy is to slow.
Analysing the generated Code show various calls of rt_Lookup
real_T rt_Lookup(const real_T *x, int_T xlen, real_T u, const real_T *y)
// x is the pointer the matrix The Adress of the matrix is declared in a global structure `rtDataAddrMap` statically. I can read it out, but do not know how to change.
What I like to achieve is:
Define a second map in my application (same size).
Write all new value the this second map.
Change just the pointer in rtDataAddrMap to activate the second
map.
The general question:
How can I achieve to change multidimensional parameters atomically?
What is the regular way to do this? (Code Generation Options etc..)
The specific question: (if my approach was right)
What are reasonable solutions to change the data pointer of a matrix?

Atomic in the sense of calling an instruction which does its work in a single clock cycle (and thus not possible to interrupt) is not possible to achieve when it comes to this kind of multidimensional arrays. Instead you will need some kind of real time mechanism like a mutex or semaphore to protect your data. Mutexes and semaphores are built upon atomic operations which guarantees that two processes will not be able to consume the same resource at once.
Your approach with ping pong buffering of your data area will probably improve performance. Unfortunately I do not have enough experience from Mathworks generated code to tell how to implement that.

Related

How can I use Intel PIN to catch all loads to an array?

I'm profiling an application I have written using PIN. The source code of the application uses an array - I want PIN to catch every load instruction made to the array.
Currently, I have annotated the source code of the application I am trying to profile. Every time I read from the array, I first call a function startRegionOfInterest(). Once I finish reading from the array I call another function endRegionOfInterest(). I can use PIN to easily catch calls to these two functions - whenever a load exists between these two calls I assume it's a load to the array I'm interested in.
However, this is pretty coarse grained, and so I end up classifying a lot of loads that are NOT to the array of interest as reads to the array.
Is there an easier way for me to more precisely catch all loads made to the array I'm interested in?
In your startRegionOfInterest method, you can use some kind of indicator sequence to pass the array address to your PIN tool. E.g., store a magic constant, then store the array address, something like:
volatile void *sink;
void startRegionOfInterest(void *array) {
sink = (void *)0x48829d2f384be;
sink = array;
}
In your PIN tool, you look for a store of the magic constant (within the startRegionOfInterest call for extra specificity, if you want), and then you know the next store is the address of the array. You can communicate the length similarly.
Implementing the sequence with inline asm instead you can remove the variability associated with compiler and optimizer behavior, although I think the volatile approach should work in practice (although you might have to skip some intervening non-store instructions. A godbolt.

How to collect data for each thread OpenMP

I'm new to OpenMP and try to sort out the issue of collecting data from threads. I study the example of applying OpenMP on Monte-Carlo method (square of a circle inscribed into a square).
I understood how the following code works:
unsigned pointsInside = 0;
#pragma omp parallel for num_threads(threadNum) shared(threadNum) reduction(+: pointsInside)
for (unsigned i = 0; i < threadNum; i++) { ... }
Am I right that originally pointsInsideis a variable but OpenMP represents it as an array and than the mantra reduction(+: pointsInside) sums over the elements of the "array"?
But the main question is how to collect information directly into an array or vector? I tried to declare array or vector and provide pointer or address into OpenMP via shared and collect information for each thread at corresponding index. But it works slower than the way with the variable and reduction. Such the approach with vector or array is needed for me for my current project. Thanks a lot!
UPD:
When I said above that "it works slower" I meant comparison of two realizations of the Monte-Carlo method: 1) via shared and a vector/array, and 2) via a scalar variable and reduction. The first case is faster. My guess and question about it below.
I would like to rephrase my question more clear. I create a vector/array and provide it into OpenMP via shared. I want to collect data for each thread at corresponding index in vector/array. Under this approach I don't need any synchronization of access to the vector/array. Is it true that OpenMP enable synchronization by default when I use shared. If it is so, then how to disable it. Or may be another approaches exist. If it is not so, then how to share vector/array into the parallel part correctly and without synchronization of access.
I'd like to apply this technique for my project where I want to sort through different permutations in parallel part, collect each permutation and scalar result for it outside of the parallel part. Then sort the results and choose the best one.
A partial answer:
Am I right that originally pointsInside is a variable but OpenMP represents it as an array and than the mantra reduction(+: pointsInside) sums over the elements of the "array"?
I think it is better to think of pointsInside as a scalar. When the parallel region starts the run-time takes care of creating individual scalars, perhaps you might think of them as myPointsInside, one such scalar for each thread. When the parallel region finishes the run-time reduces the values of all the thread scalars onto the original scalar pointsInside. This is just about what OpenMP actually does behind the scenes.
As to the rest of your question:
Yes, you can perform reductions onto arrays - but this was only added to OpenMP, for C and C++ programs, in OpenMP 4.5 (I think). What goes on is much the same as for the scalar case. This Q&A provides some assistance - Is it possible to do a reduction on an array with openmp?
As to the speed, it's difficult to answer that without a much clearer understanding of what comparisons you are making. But it's very easy to write parallel reductions on arrays which incur a significant penalty in performance from the phenomenon of false sharing, about which you may wish to inform yourself.

Fftw3 library and plans reuse

I'm about to use fftw3 library in my very certain task.
I have a heavy load packets stream with variable frame size, which is produced like that:
while(thereIsStillData){
copyDataToInputArray();
createFFTWPlan();
performExecution();
destroyPlan();
}
Since creating plans is rather expensive, I want to modify my code to something like this:
while(thereIsStillData){
if(inputArraySizeDiffers()) destroyOldAndCreateNewPlan();
copyDataToInputArray(); // e.g. `memcpy` or `std::copy`;
performExecution();
}
Can I do this? I mean, does plan contain some important information based on data such, that plan created for one array with size N, when executed will give incorrect results for the other array of same size N.
The fftw_execute() function does not modify the plan presented to it, and can be called multiple times with the same plan. Note, however, that the plan contains pointers to the input and output arrays, so if copyDataToInputArray() involves creating a different input (or output) array then you cannot afterwards use the old plan in fftw_execute() to transform the new data.
FFTW does, however, have a set of "New-array Execute Functions" that could help here, supposing that the new arrays satisfy some additional similarity criteria with respect to the old (see linked docs for details).
The docs do recommend:
If you are tempted to use the new-array execute interface because you want to transform a known bunch of arrays of the same size, you should probably go use the advanced interface instead
but that's talking about transforming multiple arrays that are all in memory simultaneously, and arranged in a regular manner.
Note, too, that if your variable frame size is not too variable -- that is, if it is always one of a relatively small number of choices -- then you could consider keeping a separate plan in memory for each frame size instead of recomputing a plan every time one frame's size differs from the previous one's.

Training data structure and access

I'm writing up an implementation of backpropagation for a feedforward neural network in C++ and I'm using the Armadillo library. Right now, I'm loading training data with the method load for the class matrix in the Armadillo library. Two questions:
1) Is this a reasonable choice for storing pre-formatted (CSV), numeric data that fits into main memory (<2GB)? Certainly there are better ways to do this than others and it'd be nice to know if this is not a good practice. Part of me feels like this isn't a good choice for holding the data as there are likely more data-ish structures/frameworks (like I should be accessing some SQL database or something). Another part of me feels like numeric data is by definition just matrices so this should be wonderful.
2) I need to sample without replacement from a data set in my implementation and I see two routes: either I could shuffle the rows of the data set or shuffle an array that indexes the data set. There is a shuffle method for the matrix class in the Armadillo library and I'm suspicious that what is shuffled is addresses and not the rows themselves. Wouldn't that be just as efficient as shuffling an indexing array?
1) Yes, this is fine and it's how I would do it, but note that Armadillo matrices are column-major and thus you may need to transpose the CSV that you load. If your data is sufficiently large that it won't fit in main memory, you could consider writing a custom CSV parser that looks at the data in a streaming sense (i.e. one point at a time), thus reducing your RAM footprint, or you could even use mmap() to map a file full of packed doubles as your matrix and let the kernel work out what needs to be swapped in when.
2) Because all matrix data is stored contiguously (i.e. double* not double**), shuffle() will be moving the elements in the matrix. What I generally do in this type of situation is create a vector of indices and shuffle it:
uvec indices = linspace<uvec>(0, n, n);
shuffle(indices);
// Now loop over each shuffled point...
for (uword i = 0; i < n; ++i)
{
// access the point with data.col(indices[i]) and do whatever
}
(The above code isn't tested, but it should work or easily be adapted into something that works.)
For what it's worth, mlpack (http://www.mlpack.org/) does have a not-yet-stable neural network infrastructure that uses Armadillo, and it may be worth your time to check out; the link below is to the relevant source directly, but poking around on Github and the mlpack website should reveal better documentation.
https://github.com/mlpack/mlpack/tree/master/src/mlpack/methods/ann

MPI synchronize matrix of vectors

Excuse me if this question is common or trivial, I am not very familiar with MPI so bear with me.
I have a matrix of vectors. Each vector is empty or has a few items in it.
std::vector<someStruct*> partitions[matrix_size][matrix_size];
When I start the program each process will have the same data in this matrix, but as the code progresses each process might remove several items from some vectors and put them in other vectors.
So when I reach a barrier I somehow have to make sure each process has the latest version of this matrix. The big problem is that each process might manipulate any or all vectors.
How would I go about to make sure that every process has the correct updated matrix after the barrier?
EDIT:
I am sorry I was not clear. Each process may move one or more objects to another vector but only one process may move each object. In other words each process has a list of objects it may move, but the matrix may be altered by everyone. And two processes can't move the same object ever.
In that case you'll need to send messages using MPI_Bcast that inform the other processors about this and instruct them to do the same. Alternatively, if the ordering doesn't matter until you hit the barrier, you can only send the messages to the root process which performs the permutations and then after the barrier sends it to all the others using MPI_Bcast.
One more thing: vectors of pointers are usually quite a bad idea, as you'll need to manage the memory manually in there. If you can use C++11, use std::unique_ptr or std::shared_ptr instead (depending on what your semantics are), or use Boost which provides very similar facilities.
And lastly, representing a matrix as a fixed-size array of fixed-size arrays is readlly bad. First: the matrix size is fixed. Second: adjacent rows are not necessarily stored in contiguous memory, slowing your program down like crazy (it literally can be orders of magnitudes). Instead represent the matrix as a linear array of size Nrows*Ncols, and then index the elements as Nrows*i + j where Nrows is the number of rows and i and j are the row and column indices, respectively. If you don't want column-major storage instead, address the elements by i + Ncols*j. You can wrap this index-juggling in inline functions that have virtually zero overhead.
I would suggest to lay out the data differently:
Each process has a map of his objects and their position in the matrix. How that is implemented depends on how you identify objects. If all local objects are numbered, you could just use a vector<pair<int,int>>.
Treat that as the primary structure you manipulate and communicate that structure with MPI_Allgather (each process sends it data to all other processes, at the end everyone has all data). If you need fast lookup by coordinates, then you can build up a cache.
That may or may not be performing well. Other optimizations (like sharing 'transactions') totally depend on your objects and the operations you perform on them.