I am working on digital sampling for sensor. I have following code to compute the highest amplitude and the corresponding time.
struct LidarPoints{
float timeStamp;
float Power;
}
std::vector<LidarPoints> measurement; // To store Lidar points of current measurement
Currently power and energy are the same (because of delta function)and vector is arranged in ascending order of time. I would like to change this to step function. Pulse duration is a constant 10ns.
uint32_t pulseDuration = 5;
The problem is to find any overlap between the samples and if any to add up the amplitudes.
I currently use following code:
for(auto i= 0; i< measurement.size(); i++){
for(auto j=i+1; i< measurement.size(); j++){
if(measurement[j].timeStamp - measurement[i].timeStamp) < pulseDuration){
measurement[i].Power += measurement[j].Power;
measurement[i].timeStamp = (measurement[i].timeStamp + measurement[j].timeStamp)/2.0f;
}
}
}
Is it possible to code this without two for loops since I cannot afford the amount of time being taken by nested loops.
You can take advantage that the vector is sorted by timeStamp and find the next pulse with binary search, thus reducing the complexity from O(n^2) to O(n log n):
#include <vector>
#include <algorithm>
#include <numeric>
#include <iterator
auto it = measurement.begin();
auto end = measurement.end();
while (it != end)
{
// next timestamp as in your code
auto timeStampLower = it->timeStamp + pulseDuration;
// next value in measurement with a timestamp >= timeStampLower
auto lower_bound = std::lower_bound(it, end, timeStampLower, [](float a, const LidarPoints& b) {
return a < b.timeStamp;
});
// sum over [timeStamp, timeStampLower)
float sum = std::accumulate(it, lower_bound, 0.0f, [] (float a, const LidarPoints& b) {
return a + b.timeStamp;
});
auto num = std::distance(it, lower_bound);
// num should be >= since the vector is sorted and pulseDuration is positive
// you should uncomment next line to catch unexpected error
// Expects(num >= 1); // needs GSL library
// assert(num >= 1); // or standard C if you don't want to use GSL
// average over [timeStamp, timeStampLower)
it->timeStamp = sum / num;
// advance it
it = lower_bound;
}
https://en.cppreference.com/w/cpp/algorithm/lower_bound
https://en.cppreference.com/w/cpp/algorithm/accumulate
Also please note that my algorithm will produce different result than yours because you don't really compute the average over multiple values with measurement[i].timeStamp = (measurement[i].timeStamp + measurement[j].timeStamp)/2.0f
Also to consider: (I am by far not an expert in the field, so I am just throwing the ideea, it's up to you to know if its valid or not): with your code you just "squash" together close measurement, instead of having a vector of measurement with periodic time. It might be what you intend or not.
Disclaimer: not tested beyond "it compiles". Please don't just copy-paste it. It could be incomplet and incorrekt. But I hope I gave you a direction to investigate.
Due to jitter and other timing complexities, instead of simple summation, you need to switch to [Numerical Integration][۱] (eg. Trapezoidal Integration...).
If your values are in ascending order of timeStamp adding else break to the if statement shouldn't effect the result but should be a lot quicker.
for(auto i= 0; i< measurement.size(); i++){
for(auto j=i+1; i< measurement.size(); j++){
if(measurement[j].timeStamp - measurement[i].timeStamp) < pulseDuration){
measurement[i].Power += measurement[j].Power;
measurement[i].timeStamp = (measurement[i].timeStamp + measurement[j].timeStamp)/2.0f;
} else {
break;
}
}
}
I have a vector of structures that stores a stream of values that arrive at different intervals. The structure consists of two elements , one for the value and the other records the time when the value arrived.
struct Data {
Time timeOfArrival;
Time value;
}cd;
Lets say that in another thread I want to calculate the moving average of the values that arrived in the last 10 (say) seconds. So in one thread the vector is being populated and in another I want to calculate the moving average.
Here is how I would do it. For simplicity's sake I redefined Data as follows:
struct Data
{
int timeOfArrival;
int value;
};
One way to do what you asked is to use a circular buffer where you store just the amount of data that you need for your moving average.
enum { MOVING_AVG_SIZE = 64, }; // number of elements that you use for your moving average
std::vector<Data> buffer(MOVING_AVG_SIZE);
std::vector<Data>::iterator insertIt = buffer.begin();
// saving to circular buffer
Data newData;
++insertIt
if (insertIt == buffer.end()) insertIt = buffer.begin();
*insertIt = newData;
// average
int sum = 0;
for (std::vector<Data>::const_iterator it = buffer.begin(); it != buffer.end(); ++it)
{
sum += it->value;
}
float avg = sum / (float)buffer.size();
If you don't have a circular buffer and you just keep adding values to your vector, then you can just get the last number of elements needed for calculating your moving average.
// saving to circular buffer
Data newData;
buffer.push_back(newData);
// average
// this algorithm calculates the moving average even if there is not enough samples in the buffer for the "10 s"
std::vector<Data>::const_reverse_iterator it = buffer.rbegin();
int i;
int sum = 0;
for (i = 0; i < MOVING_AVG_SIZE || it == buffer.rend(); ++i)
{
sum += it->value;
}
float avg = sum / (float)i;
As you have already decided that one thread will always populate and another thread will do moving average. You can then do this:
Keep a structure with two elements RunningSum and no of items in vector.
Write a loop which removes elements which are older than 10 sec and deduct their values from RunningSum. All elemenst in vector are sorted on timeofArrival, so you need not iterate the whole vector.
Add to sum the value of new elemnts which are not yet added.
You need a way to deffrentiate items that have been added( used in sum) and the one that are not yet summed. You can use a boolean for this or put them in a new dataStructure(internal to your class).
Keep count of number of elements and calculate the average.
According to Visual Studio's performance analyzer, the following function is consuming what seems to me to be an abnormally large amount of processor power, seeing as all it does is add between 1 and 3 numbers from several vectors and store the result in one of those vectors.
//Relevant class members:
//vector<double> cache (~80,000);
//int inputSize;
//Notes:
//RealFFT::real is a typedef for POD double.
//RealFFT::RealSet is a wrapper class for a c-style array of RealFFT::real.
//This is because of the FFT library I'm using (FFTW).
//It's bracket operator is overloaded to return a const reference to the appropriate array element
vector<RealFFT::real> Convolver::store(vector<RealFFT::RealSet>& data)
{
int cr = inputSize; //'cache' read position
int cw = 0; //'cache' write position
int di = 0; //index within 'data' vector (ex. data[di])
int bi = 0; //index within 'data' element (ex. data[di][bi])
int blockSize = irBlockSize();
int dataSize = data.size();
int cacheSize = cache.size();
//Basically, this takes the existing values in 'cache', sums them with the
//values in 'data' at the appropriate positions, and stores them back in
//the cache at a new position.
while (cw < cacheSize)
{
int n = 0;
if (di < dataSize)
n = data[di][bi];
if (di > 0 && bi < inputSize)
n += data[di - 1][blockSize + bi];
if (++bi == blockSize)
{
di++;
bi = 0;
}
if (cr < cacheSize)
n += cache[cr++];
cache[cw++] = n;
}
//Take the first 'inputSize' number of values and return them to a new vector.
return Common::vecTake<RealFFT::real>(inputSize, cache, 0);
}
Granted, the vectors in question have sizes of around 80,000 items, but by comparison, a function which multiplies similar vectors of complex numbers (complex multiplication requires 4 real multiplications and 2 additions each) consumes about 1/3 the processor power.
Perhaps it has something to with the fact it has to jump around within the vectors rather then just accessing them linearly? I really have no idea though. Any thoughts on how this could be optimized?
Edit: I should mention I also tried writing the function to access each vector linearly, but this requires more total iterations and actually the performance was worse that way.
Turn on compiler optimization as appropriate. A guide for MSVC is here:
http://msdn.microsoft.com/en-us/library/k1ack8f1.aspx
I am a newbie in OpenCL. However, I understand the C/C++ basics and the OOP.
My question is as follows: is it somehow possible to run the sum computation task in parallel? Is it theoretically possible? Below I will describe what I've tried to do:
The task is, for example:
double* values = new double[1000]; //let's pretend it has some random values inside
double sum = 0.0;
for(int i = 0; i < 1000; i++) {
sum += values[i];
}
What I tried to do in OpenCL kernel (and I feel it is wrong because perhaps it accesses the same "sum" variable from different threads/tasks at the same time):
__kernel void calculate2dim(__global float* vectors1dim,
__global float output,
const unsigned int count) {
int i = get_global_id(0);
output += vectors1dim[i];
}
This code is wrong. I will highly appreciate if anyone answers me if it is theoretically possible to run such tasks in parallel and if it is - how!
If you want to sum the values of your array in a parallel fashion, you should make sure you reduce contention and make sure there's no data dependencies across threads.
Data dependencies will cause threads to have to wait for each other, creating contention, which is what you want to avoid to get true parallellization.
One way you could do that is to split your array into N arrays, each containing some subsection of your original array, and then calling your OpenCL kernel function with each different array.
At the end, when all kernels have done the hard work, you can just sum up the results of each array into one. This operation can easily be done by the CPU.
The key is to not have any dependencies between the calculations done in each kernel, so you have to split your data and processing accordingly.
I don't know if your data has any actual dependencies from your question, but that is for you to figure out.
The piece of code I've provided for reference should do the job.
E.g. you have N elements, and size of your workgroup is WS = 64. I assume that N is multiple of 2*WS (this is important, one workgroup calculates sum of 2*WS elements). Then you need to run kernel specifying:
globalSizeX = 2*WS*(N/(2*WS));
As a result sum array will have partial sums of 2*WS elements. ( e.g. sum[1] - will contain sum of elements whose indices are from 2*WS to 4*WS-1).
If your globalSizeX is 2*WS or less (which means that you have only one workgroup), then you are done. Just use sum[0] as a result.
If not - you need to repeat procedure, this time using sum array as input array and output to other array (create 2 arrays and ping-pong between them). And so on untill you will have only one workgroup.
Search also for Hilli Steele / Blelloch parallel algorithms.
This article could be useful as well
Here is the actual example:
__kernel void par_sum(__global unsigned int* input, __global unsigned int* sum)
{
int li = get_local_id(0);
int groupId = get_group_id(0);
__local int our_h[2 * get_group_size(0)];
our_h[2*li + 0] = hist[2*get_group_size(0)*blockId + 2*li + 0];
our_h[2*li + 1] = hist[2*get_group_size(0)*blockId + 2*li + 1];
// sweep up
int width = 2;
int num_el = 2*get_group_size(0)/width;
int wby2 = width>>1;
for(int i = 2*BLK_SIZ>>1; i>0; i>>=1)
{
barrier(CLK_LOCL_MEM_FENCE);
if(li < num_el)
{
int idx = width*(li+1) - 1;
our_h[idx] = our_h[idx] + our_h[(idx - wby2)];
}
width<<=1;
wby2 = width>>1;
num_el>>=1;
}
barrier(CLK_LOCL_MEM_FENCE);
// down-sweep
if(0 == li)
sum[groupId] = our_h[2*get_group_size(0)-1]; // save sum
}
So I have this function used to calculate statistics (min/max/std/mean). Now the thing is this runs generally on a 10,000 by 15,000 matrix. The matrix is stored as a vector<vector<int> > inside the class. Now creating and populating said matrix goes very fast, but when it comes down to the statistics part it becomes so incredibly slow.
E.g. to read all the pixel values of the geotiff one pixel at a time takes around 30 seconds. (which involves a lot of complex math to properly georeference the pixel values to a corresponding point), to calculate the statistics of the entire matrix it takes around 6 minutes.
void CalculateStats()
{
//OHGOD
double new_mean = 0;
double new_standard_dev = 0;
int new_min = 256;
int new_max = 0;
size_t cnt = 0;
for(size_t row = 0; row < vals.size(); row++)
{
for(size_t col = 0; col < vals.at(row).size(); col++)
{
double mean_prev = new_mean;
T value = get(row, col);
new_mean += (value - new_mean) / (cnt + 1);
new_standard_dev += (value - new_mean) * (value - mean_prev);
// find new max/min's
new_min = value < new_min ? value : new_min;
new_max = value > new_max ? value : new_max;
cnt++;
}
}
stats_standard_dev = sqrt(new_standard_dev / (vals.size() * vals.at(0).size()) + 1);
std::cout << stats_standard_dev << std::endl;
}
Am I doing something horrible here?
EDIT
To respond to the comments, T would be an int.
EDIT 2
I fixed my std algorithm, and here is the final product:
void CalculateStats(const std::vector<double>& ignore_values)
{
//OHGOD
double new_mean = 0;
double new_standard_dev = 0;
int new_min = 256;
int new_max = 0;
size_t cnt = 0;
int n = 0;
double delta = 0.0;
double mean2 = 0.0;
std::vector<double>::const_iterator ignore_begin = ignore_values.begin();
std::vector<double>::const_iterator ignore_end = ignore_values.end();
for(std::vector<std::vector<T> >::const_iterator row = vals.begin(), row_end = vals.end(); row != row_end; ++row)
{
for(std::vector<T>::const_iterator col = row->begin(), col_end = row->end(); col != col_end; ++col)
{
// This method of calculation is based on Knuth's algorithm.
T value = *col;
if(std::find(ignore_begin, ignore_end, value) != ignore_end)
continue;
n++;
delta = value - new_mean;
new_mean = new_mean + (delta / n);
mean2 = mean2 + (delta * (value - new_mean));
// Find new max/min's.
new_min = value < new_min ? value : new_min;
new_max = value > new_max ? value : new_max;
}
}
stats_standard_dev = mean2 / (n - 1);
stats_min = new_min;
stats_max = new_max;
stats_mean = new_mean;
This still takes ~120-130 seconds to do this, but it's a huge improvement :)!
Have you tried to profile your code?
You don't even need a fancy profiler. Just stick some debug timing statements in there.
Anything I tell you would just be an educated guess (and probably wrong)
You could be getting lots of cache misses due to the way you're accessing the contents of the vector. You might want to cache some of the results to size() but I don't know if that's the issue.
I just profiled it. 90% of the execution time was in this line:
new_mean += (value - new_mean) / (cnt + 1);
You should calculate the sum of values, min, max and count in the first loop,
then calculate the mean in one operation by dividing sum/count,
then in a second loop calculate std_dev's sum
That would probably be a bit faster.
First thing I spotted is that you evaluate vals.at(row).size() in the loop, which, obviously, isn't supposed to improve performance. It also applies to vals.size(), but of course inner loop is worse. If vals is a vector of vector, you better use iterators or at least keep reference for the outer vector (because get() with indices parameters surely eats up quite some time as well).
This code sample is supposed to illustrate my intentions ;-)
for(TVO::const_iterator i=vals.begin(),ie=vals.end();i!=ie;++i) {
for(TVI::const_iterator ii=i->begin(),iie=i->end();ii!=iie;++ii) {
T value = *ii;
// the rest
}
}
First, change your row++ to ++row. A minor thing, but you want speed, so that will help
Second, make your row < vals.size into some const comparison instead. The compiler doesn't know that vals won't change, so it has to play nice and always call size.
what is the 'get' method in the middle there? What does that do? That might be your real problem.
I'm not too sure about your std dev calculation. Take a look at the wikipedia page on calculating variance in a single pass (they have a quick explanation of Knuth's algorithm, which is an expansion of a recursion relation).
It's slow because you're benchmarking debug code.
Building and running the code on Windows XP using VS2008:
a Release build with the default optimisation level, the code in the OP runs in 2734 ms.
a Debug build with the default of no optimisation, the code in the OP runs in a massive 398,531 ms.
In comments below you say you're not using optimisation, and this appears to make a big difference in this case - normally it's less that a factor of ten, but in this case it's over a hundred times slower.
I'm using VS2008 rather than 2005, but it's probably similar:
In the Debug build, there are two range checks on each access, each of which calls std::vector::size() using a non-inlined function call and requires a branch predicition. There is overhead involved both with function calls and with branches.
In the Release build, the compiler optimizes away the range checks ( I don't know whether it just drops them, or does flow analysis based on the limits of the loop ), and the vector access becomes a small amount of inline pointer arithmetic with no branches.
No-one cares how fast the debug build is. You should be unit testing the release build anyway, as that's the build which has to work correctly. Only use the Debug build if you don't all the information you want if you try and step through the code.
The code as posted runs in < 1.5 seconds on my PC with test data of 15000 x 10000 integers all equal to 42. You report that it's running in 230 times slower that that. Are you on a 10 MHz processor?
Though there are other suggestions for making it faster ( such as moving it to use SSE, if all the values are representable using 8bit types ), but there's clearly something else which is making it slow.
On my machine, neither a version which hoisted a reference to the vector for the row and hoisting the size of the row, nor a version which used iterator had any measurable benefit ( with g++ -O3 using iterators takes 1511ms repeatably; the hoisted and original version both take 1485ms ). Not optimising means it runs in 7487ms ( original ), 3496ms ( hoisted ) or 5331ms ( iterators ).
But unless you're running on a very low power device, or are paging, or a running non-optimised code with a debugger attached, it shouldn't be this slow, and whatever is making it slow is not likely to be the code you've posted.
( as a side note, if you test it with values with a deviation of zero your SD comes out as 1 )
There are far too many calculations in the inner loop:
For the descriptive statistics (mean, standard
deviation) the only thing required is to compute the sum
of value and the sum of squared value. From these
two sums the mean and standard deviation can be computed
after the outer loop (together with a third value, the
number of samples - n is your new/updated code). The
equations can be derived from the definitions or found
on the web, e.g. Wikipedia. For instance the mean is
just sum of value divided by n. For the n version (in
contrast to the n-1 version - however n is large in
this case so it doesn't matter which one is used) the
standard deviation is: sqrt( n * sumOfSquaredValue -
sumOfValue * sumOfValue). Thus only two floating point
additions and one multiplication are needed in the
inner loop. Overflow is not a problem with these sums as
the range for doubles is 10^318. In particular you will
get rid of the expensive floating point division that
the profiling reported in another answer has revealed.
A lesser problem is that the minimum and maximum are
rewritten every time (the compiler may or may not
prevent this). As the minimum quickly becomes small and
the maximum quickly becomes large, only the two comparisons
should happen for the majority of loop iterations: use
if statements instead to be sure. It can be argued, but
on the other hand it is trivial to do.
I would change how I access the data. Assuming you are using std::vector for your container you could do something like this:
vector<vector<T> >::const_iterator row;
vector<vector<T> >::const_iterator row_end = vals.end();
for(row = vals.begin(); row < row_end; ++row)
{
vector<T>::const_iterator value;
vector<T>::const_iterator value_end = row->end();
for(value = row->begin(); value < value_end; ++value)
{
double mean_prev = new_mean;
new_mean += (*value - new_mean) / (cnt + 1);
new_standard_dev += (*value - new_mean) * (*value - mean_prev);
// find new max/min's
new_min = min(*value, new_min);
new_max = max(*value, new_max);
cnt++;
}
}
The advantage of this is that in your inner loop you aren't consulting the outter vector, just the inner one.
If you container type is a list, this will be significantly faster. Because the look up time of get/operator[] is linear for a list and constant for a vector.
Edit, I moved the call to end() out of the loop.
Move the .size() calls to before each loop, and make sure you are compiling with optimizations turned on.
If your matrix is stored as a vector of vectors, then in the outer for loop you should directly retrieve the i-th vector, and then operate on that in the inner loop. Try that and see if it improves performance.
I'm nor sure of what type vals is but vals.at(row).size() could take a long time if itself iterates through the collection. Store that value in a variable. Otherwise it could make the algorithm more like O(n³) than O(n²)
I think that I would rewrite it to use const iterators instead of row and col indexes. I would set up a const const_iterator for row_end and col_end to compare against, just to make certain it wasn't making function calls at every loop end.
As people have mentioned, it might be get(). If it accesses neighbors, for instance, you will totally smash the cache which will greatly reduce the performance. You should profile, or just think about access patterns.
Coming a bit late to the party here, but a couple of points:
You're effectively doing numerical work here. I don't know much about numerical algorithms, but I know enough to know that references and expert support are often useful. This discussion thread offers some references; and Numerical Recipes is a standard (if dated) work.
If you have the opportunity to redesign your matrix, you want to try using a valarray and slices instead of vectors of vectors; one advantage that immediately comes to mind is that you're guaranteed a flat linear layout, which makes cache pre-fetching and SIMD instructions (if your compiler can use them) more effective.
In the inner loop, you shouldn't be testing size, you shouldn't be doing any divisions, and iterators can also be costly. In fact, some unrolling would be good in there.
And, of course, you should pay attention to cache locality.
If you get the loop overhead low enough, it might make sense to do it in separate passes: one to get the sum (which you divide to get the mean), one to get the sum of squares (which you combine with the sum to get the variance), and one to get the min and/or max. The reason is to simplify what is in the inner unrolled loop so the compiler can keep stuff in registers.
I couldn't get the code to compile, so I couldn't pinpoint issues for sure.
I have modified the algorithm to get rid of almost all of the floating-point division.
WARNING: UNTESTED CODE!!!
void CalculateStats()
{
//OHGOD
double accum_f;
double accum_sq_f;
double new_mean = 0;
double new_standard_dev = 0;
int new_min = 256;
int new_max = 0;
const int oku = 100000000;
int accum_ichi = 0;
int accum_oku = 0;
int accum_sq_ichi = 0;
int accum_sq_oku = 0;
size_t cnt = 0;
int v1 = 0;
int v2 = 0;
v1 = vals.size();
for(size_t row = 0; row < v1; row++)
{
v2 = vals.at(row).size();
for(size_t col = 0; col < v2; col++)
{
T value = get(row, col);
int accum_ichi += value;
int accum_sq_ichi += (value * value);
// perform carries
accum_oku += (accum_ichi / oku);
accum_ichi %= oku;
accum_sq_oku += (accum_sq_ichi / oku);
accum_sq_ichi %= oku;
// find new max/min's
new_min = value < new_min ? value : new_min;
new_max = value > new_max ? value : new_max;
cnt++;
}
}
// now, and only now, do we use floating-point arithmetic
accum_f = (double)(oku) * (double)(accum_oku) + (double)(accum_ichi);
accum_sq_f = (double)(oku) * (double)(accum_sq_oku) + (double)(accum_sq_ichi);
new_mean = accum_f / (double)(cnt);
// standard deviation formula from Wikipedia
stats_standard_dev = sqrt((double)(cnt)*accum_sq_f - accum_f*accum_f)/(double)(cnt);
std::cout << stats_standard_dev << std::endl;
}