I encountered a problem when I tried to calculate the mean of an array in two ways. Below is the code:
float sum1, sum2, tmp, mean1, mean2;
double sum1_double, sum2_double, tmp_double;
int i, j;
int Nt=29040000; //array size
int piecesize=32;
int Npiece=Nt/piecesize;
float* img;
float* d_img;
double* img_double;
img_double = (double*)calloc(Nt, sizeof(double));
cudaHostAlloc((void**)&img, sizeof(float)*Nt, cudaHostAllocDefault);
cudaMalloc((void**)&d_img, sizeof(float)*Nt);
...
//Some calculation is done in GPU and the results are stored in d_img;
...
cudaMemcpy(img, d_img, Nt*sizeof(float), cudaMemcpyDeviceToHost);
for (i=0;i<Nt;i++) img_double[i]=(double)img[i];
//Method 1
sum1=0;
for (i=0;i<Nt;i++)
{ sum1 += img[i]; }
sum1_double=0;
for (i=0;i<Nt;i++)
{ sum1_double += img_double[i]; }
//Method 2
sum2=0;
for (i=0;i<Npiece;i++)
{ tmp=0;
for (j=0;j<piecesize;j++)
{ tmp += img[i*piecesize+j];}
sum2 += tmp;
}
sum2_double=0;
for (i=0;i<Npiece;i++)
{ tmp_double=0;
for (j=0;j<piecesize;j++)
{ tmp_double += img_double[i*piecesize+j];}
sum2_double += tmp_double;
}
mean1=sum1/(float)Nt;
mean2=sum2/(float)Nt;
mean1_double=sum1_double/(double)Nt;
mean2_double=sum2_double/(double)Nt;
cout<<setprecision(15)<<mean1<<endl;
cout<<setprecision(15)<<mean2<<endl;
cout<<setprecision(15)<<mean1_double<<endl;
cout<<setprecision(15)<<mean2_double<<endl;
Output:
132.221862792969
129.565872192383
129.565938340543
129.565938340543
The results obtained from the two methods, mean1=129.6, mean2=132.2, are significantly different. May I know why?
Thanks a lot in advance!
The reason is that floating point arithmetic is not precise. When you accumulate integers, float becomes imprecise when abs(value) becomes larger than 224 (I'm supposing IEEE-754 32-bit here). For example, float is incapable to store 16777217 precisely (it will become 16777216 or 16777218, depending on the rounding mode).
Supposedly your second calculation is the more precise one, as less precision is lost, because of the separate tmp accumulation.
Change your sum1, sum2, tmp variables to long long int, and hopefully you'll get the same result for both calculations.
Note: I've supposed that your img stores integer data. If it stores floats, then there is no easy way to fix this perfectly. One way is to use double instead of float for sum1, sum2 and tmp. The difference will be there, but it will be much smaller. And there are techniques how to accumuluate floats more precisely than simple summing. Like Kahan Summation.
Related
for signal processing I need to compute relatively large C arrays as shown in the code part below. This is working fine so far, unfortunately, the implementation is slow. The size of "calibdata" is arround 150k and needs to be calculated for different frequencies/phases. Is there a way to improve speed significantly? Doing the same with logical indexing in MATLAB is way faster.
What I tried already:
using taylor approximation of sine: no siginificant improvement.
using std::vector, also no siginificant improvement.
code:
double phase_func(double* calibdata, long size, double* freqscale, double fs, double phase, int currentcarrier){
for (int i = 0; i < size; i++)
result += calibdata[i] * cos((2 * PI*freqscale[currentcarrier] * i / fs) + (phase*(PI / 180) - (PI / 2)));
result = fabs(result / size);
return result;}
Best regards,
Thomas
When optimizing code for speed, step 1 is to enable compiler optimizations. I hope you've done that already.
Step 2 is to profile the code and see exactly how the time is being spent. Without profiling, you're just guessing, and you could end up trying to optimize the wrong thing.
For example, your guess seems to be that the cos function is the bottleneck. But the other possibility is that the calculation of the angle is the bottleneck. Here's how I would refactor the code to reduce the time spent calculating the angle.
double phase_func(double* calibdata, long size, double* freqscale, double fs, double phase, int currentcarrier)
{
double result = 0;
double angle = phase * (PI / 180) - (PI / 2);
double delta = 2 * PI * freqscale[currentcarrier] / fs;
for (int i = 0; i < size; i++)
{
result += calibdata[i] * cos( angle );
angle += delta;
}
return fabs(result / size);
}
Okay, I'm probably going to get flogged for this answer, but I would use the GPU for this. Because your array doesn't appear to be self-referential, the best speedup you're going to get for large arrays is through parallelization... by far. I don't use MATLAB, but I just did a quick search for GPU utilization on the MathWorks site:
http://www.mathworks.com/company/newsletters/articles/gpu-programming-in-matlab.html?requestedDomain=www.mathworks.com
Outside of MATLAB you could use OpenCL or CUDA yourself.
Your enemies in execution time are:
Division
Function calls (including implicit ones in loops)
Accessing data from diffent areas
Operating dissimilar instructions
You should research on Data Driving programming and using the data cache effectively.
Division
Whether with hardware support or software support division takes a long time by its very nature. Eliminate if possibly by changing the numeric base or factoring out of the loop (if possible).
Function Calls
The most efficient method of execution is sequential. Processors are optimized for this. A branch may require the processor perform some additional calculation (branch prediction) or reloading of the instruction cache / pipeline. A waste of time (that could be spent executing data instructions).
The optimization for this is to use techniques like loop unrolling and inlining of small functions. Also reduce the quantity of branches by simplifying expressions and using Boolean algebra.
Accessing data from different areas
Modern processors are optimized to operate on local data (data in one area). One example is loading an internal cache with data. Specifically, loading a cache line with data. For example, if the data from your arrays is in one location and the cosine data in another, this may cause the data cache to be reloaded, again wasting time.
A better solution is to place all data contiguously or to contiguously access all the data. Rather than making many discontiguous accesses to the cosine table, look up a batch of cosine values sequentially (without any other data accesses between).
Dissimilar Instructions
Modern processors are more efficient at processing a batch of similar instructions. For example the pattern load, add, store is more efficient for blocks when all the loading is performed, then all adding, then all storing.
Summary
Here's an example:
register double result = 0.0;
register unsigned int i = 0U;
for (i = 0; i < size; i += 2)
{
register double cos_angle1 = /* ... */;
register double cos_angle2 = /* ... */;
result += calibdata[i + 0] * cos_angle1;
result += calibdata[i + 1] * cos_angle2;
}
The above loop is unrolled and like operations are performed in groups.
Although the keyword register may be deprecated, it is a suggestion to the compiler to use dedicated registers (if possible).
You can try to use the definition of cosine based on the complex exponential:
where j^2=-1.
Store exp((2 * PI*freqscale[currentcarrier] / fs)*j) and exp(phase*j). Evaluating cos(...) then resumes to a couple of products and additions in the for loops, and sin(), cos() and exp() are only called a couple of times.
Here goes the implementation:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <complex.h>
#include <time.h>
#define PI 3.141592653589
typedef struct cos_plan{
double complex* expo;
int size;
}cos_plan;
double phase_func(double* calibdata, long size, double* freqscale, double fs, double phase, int currentcarrier){
double result=0; //initialization
for (int i = 0; i < size; i++){
result += calibdata[i] * cos ( (2 * PI*freqscale[currentcarrier] * i / fs) + (phase*(PI / 180.) - (PI / 2.)) );
//printf("i %d cos %g\n",i,cos ( (2 * PI*freqscale[currentcarrier] * i / fs) + (phase*(PI / 180.) - (PI / 2.)) ));
}
result = fabs(result / size);
return result;
}
double phase_func2(double* calibdata, long size, double* freqscale, double fs, double phase, int currentcarrier, cos_plan* plan){
//first, let's compute the exponentials:
//double complex phaseexp=cos(phase*(PI / 180.) - (PI / 2.))+sin(phase*(PI / 180.) - (PI / 2.))*I;
//double complex phaseexpm=conj(phaseexp);
double phasesin=sin(phase*(PI / 180.) - (PI / 2.));
double phasecos=cos(phase*(PI / 180.) - (PI / 2.));
if (plan->size<size){
double complex *tmp=realloc(plan->expo,size*sizeof(double complex));
if(tmp==NULL){fprintf(stderr,"realloc failed\n");exit(1);}
plan->expo=tmp;
plan->size=size;
}
plan->expo[0]=1;
//plan->expo[1]=exp(2 *I* PI*freqscale[currentcarrier]/fs);
plan->expo[1]=cos(2 * PI*freqscale[currentcarrier]/fs)+sin(2 * PI*freqscale[currentcarrier]/fs)*I;
//printf("%g %g\n",creall(plan->expo[1]),cimagl(plan->expo[1]));
for(int i=2;i<size;i++){
if(i%2==0){
plan->expo[i]=plan->expo[i/2]*plan->expo[i/2];
}else{
plan->expo[i]=plan->expo[i/2]*plan->expo[i/2+1];
}
}
//computing the result
double result=0; //initialization
for(int i=0;i<size;i++){
//double coss=0.5*creall(plan->expo[i]*phaseexp+conj(plan->expo[i])*phaseexpm);
double coss=creall(plan->expo[i])*phasecos-cimagl(plan->expo[i])*phasesin;
//printf("i %d cos %g\n",i,coss);
result+=calibdata[i] *coss;
}
result = fabs(result / size);
return result;
}
int main(){
//the parameters
long n=100000000;
double* calibdata=malloc(n*sizeof(double));
if(calibdata==NULL){fprintf(stderr,"malloc failed\n");exit(1);}
int freqnb=42;
double* freqscale=malloc(freqnb*sizeof(double));
if(freqscale==NULL){fprintf(stderr,"malloc failed\n");exit(1);}
for (int i = 0; i < freqnb; i++){
freqscale[i]=i*i*0.007+i;
}
double fs=n;
double phase=0.05;
//populate calibdata
for (int i = 0; i < n; i++){
calibdata[i]=i/((double)n);
calibdata[i]=calibdata[i]*calibdata[i]-calibdata[i]+0.007/(calibdata[i]+3.0);
}
//call to sample code
clock_t t;
t = clock();
double res=phase_func(calibdata,n, freqscale, fs, phase, 13);
t = clock() - t;
printf("first call got %g in %g seconds.\n",res,((float)t)/CLOCKS_PER_SEC);
//initialize
cos_plan plan;
plan.expo=malloc(n*sizeof(double complex));
plan.size=n;
t = clock();
res=phase_func2(calibdata,n, freqscale, fs, phase, 13,&plan);
t = clock() - t;
printf("second call got %g in %g seconds.\n",res,((float)t)/CLOCKS_PER_SEC);
//cleaning
free(plan.expo);
free(calibdata);
free(freqscale);
return 0;
}
Compile with gcc main.c -o main -std=c99 -lm -Wall -O3. Using the code you provided, it take 8 seconds with size=100000000 on my computer while the execution time of the proposed solution takes 1.5 seconds... It is not so impressive, but it is not negligeable.
The solution that is presented does not involve any call to cos of sin in the for loops. Indeed, there are only multiplications and additions. The bottleneck is either the memory bandwidth or the tests and access to memory in the exponentiation by squaring (most likely first issue, since i add to use an additional array of complex).
For complex number in c, see:
How to work with complex numbers in C?
Computing e^(-j) in C
If the problem is memory bandwidth, then parallelism is required... and directly computing cos would be easier. Additional simplifications coud have be performed if freqscale[currentcarrier] / fs were an integer. Your problem is really close to the computation of Discrete Cosine Transform, the present trick is close to the Discrete Fourier Transform and the FFTW library is really good at computing these transforms.
Notice that the present code can produce innacurate results due to loss of significance : result can be much larger than cos(...)*calibdata[] when size is large. Using partial sums can resolve the issue.
Simple trig identity to eliminate the - (PI / 2). This is also more accurate than attempting the subtraction which uses machine_PI. This is important when values are near π/2.
cosine(x - π/2) == -sine(x)
Use of const and restrict: Good compilers can perform more optimizations with this knowledge. (See also #user3528438)
// double phase_func(double* calibdata, long size,
// double* freqscale, double fs, double phase, int currentcarrier) {
double phase_func(const double* restrict calibdata, long size,
const double* restrict freqscale, double fs, double phase, int currentcarrier) {
Some platforms perform faster calculations with float vs double with a tolerable loss of precision. YMMV. Profile code both ways.
// result += calibdata[i] * cos(...
result += calibdata[i] * cosf(...
Minimize recalculations.
double angle_delta = ...;
double angle_current = ...;
for (int i = 0; i < size; i++) {
result += calibdata[i] * cos(angle_current);
angle_current += angle_delta;
}
Unclear why code uses long size and and int currentcarrier. I'd expect the same type and to use type size_t. This is idiomatic for array indexing. #Daniel Jour
Reversing loops can allow a compare to 0 rather than compare to variable. Sometimes a modest performance gain.
Insure compiler optimizations are well enabled.
All together
double phase_func2(const double* restrict calibdata, size_t size,
const double* restrict freqscale, double fs, double phase,
size_t currentcarrier) {
double result = 0.0;
double angle_delta = 2.0 * PI * freqscale[currentcarrier] / fs;
double angle_current = angle_delta * (size - 1) + phase * (PI / 180);
size_t i = size;
while (i) {
result -= calibdata[--i] * sinf(angle_current);
angle_current -= angle_delta;
}
result = fabs(result / size);
return result;
}
Leveraging the cores you have, without resorting to the GPU, use OpenMP. Testing with VS2015, the invariants are lifted out of the loop by the optimizer. Enabling AVX2 and OpenMP.
double phase_func3(double* calibdata, const int size, const double* freqscale,
const double fs, const double phase, const size_t currentcarrier)
{
double result{};
constexpr double PI = 3.141592653589;
#pragma omp parallel
#pragma omp for reduction(+: result)
for (int i = 0; i < size; ++i) {
result += calibdata[i] *
cos( (2 * PI*freqscale[currentcarrier] * i / fs) + (phase*(PI / 180.0) - (PI / 2.0)));
}
result = fabs(result / size);
return result;
}
The original version with AVX enabled took: ~1.4 seconds
and adding OpenMP brought it down to: ~0.51 seconds.
Pretty nice return for two pragmas and a compiler switch.
I've been working on implementing black-body radiation according to Planck's law with the following:
double BlackBody(double T, double wavelength) {
wavelength /= 1e9; // pre-scale wavelength to meters
static const double h = 6.62606957e-34; // Planck constant
static const double c = 299792458.0; // speed of light in vacuum
static const double k = 1.3806488e-23; // Boltzmann constant
double exparg = h*c / (k*wavelength*T);
double exppart = std::exp(exparg) - 1.0;
double constpart = (2.0*h*c*c);
double powpart = pow(wavelength, -5.0);
double v = constpart * powpart / exppart;
return v;
}
I have a float[max-min+1] array, where static const int max=780, static const int min = 380. I simply iterate over the array, and put in what the BlackBody gives for the wavelength (wavelength = array-index + min). The IntensitySpectrum::BlackBody performs this iteration, while both min and max are static member vars, and the array is also inside IntensitySpectrum.
IntensitySpectrum spectrum;
Vec3 rgb = spectrum.ToRGB();
rgb /= std::max(rgb.x, std::max(rgb.y, rgb.z));
for (int xc = 0; xc < grapher.GetWidth(); xc++) {
if (xc % 10 == 0) {
spectrum.BlackBody(200.f + xc * 200.f);
spectrum.Scale(1.0f / 1e+14f);
rgb = spectrum.ToRGB();
rgb /= std::max(rgb.x, std::max(rgb.y, rgb.z));
}
for (int yc = 20; yc < 40; yc++) {
grapher(xc, yc) = grapher.FloatToUint(rgb.x, rgb.y, rgb.z);
}
}
The problem is that, the line spectrum.BlackBody() sets the 0th element of the array to NaN, and only the 0th. Also it does not happen for the very first iteration, but all the following ones where xc>=10.
The text from the VS debugger:
spectrum = {intensity=0x009bec50 {-1.#IND0000, 520718784., 537559104., 554832896., 572547904., 590712128., 609333504., ...} }
I tracked the error down, and exppart in the ::BlackBody() function becomes NaN, basically exp() returns NaN, even though it's argument is near 2.0, so definetely not overflow. But only for array index 0. It magically starts working for the rest 400 indices.
I know memory overruns might cause things like that. That's why I double checked my memory handling.
I'm linking Vec3 from another self-made library, which is much bigger, and might contain errors, but what I use from Vec3 has nothing to do with memory.
After many hours I'm completely clueless. What else can cause this? Is the optimizer or WINAPI fooling me...? (Uhm, yes, the program creates a window, with WINAPI, and uses a nearly empty WndProc that calls my code on WM_PAINT.)
Thanks for you help in advance.
Sorry for making it unclear. This is the layout:
// member
class IntensitySpectrum {
public:
void BlackBody(float temperature) {
// ...
this->intensity[i] = ::BlackBody(temperature, wavelength(i));
// ...
}
private:
static const int min = 380;
static const int max = 780;
float intensity[max-min+1];
}
// global
double BlackBody(double T, double wavelength);
If you happen to be using MSVC 2013, one possible explanation is that you have some code somewhere that is trying to convert a float infinity to int. A bug in MSVC 2013 causes an unbalanced push on the x87 FPU stack when this happens. Trigger that bug 8 times and your FPU stack is totally full, and any subsequent attempt to push a value (such as calling 'exp()') will result in an 'invalid operation' and return an indefinite (like 1.#IND). Note that even if you are compiling with SSE2 floating point instructions, this bug can still bite because the calling convention dictates that floating point return values are returned on the top of the FPU stack.
To check if this is your issue, have a look at your FPU registers just prior to the bad call to 'exp()'. If your TAGS register is all zero, then your FPU stack is full.
http://connect.microsoft.com/VisualStudio/feedback/details/806362/vc12-pollutes-the-floating-point-stack-when-casting-infinity-nan-to-unsigned-long
MS claims this will be fixed in update 2 for MSVC 2013.
The following function call only has 1 parameter:
spectrum.BlackBody(200.f + xc * 200.f);
So it cannot be calling the function you defined as
double BlackBody(double T, double wavelength)
If you look at the ::BlackBody implementation, I'm betting you have a divide by 0 error somewhere.
I am facing problem using float
in loop its value stuck at 8388608.00
int count=0;
long X=10;
cout.precision(flt::digits10);
cout<<"Iterration #"<<setw(15)<<"Add"<<setw(21)<<"Mult"<<endl;
float Start=0.0;
float Multiplication = Addition * N;
long i = 1;
for (i; i <= N; i++){
float temp = Start + Addition;
Start=temp;
count++;
if(count%X==0 && count!=0)
{
X*=10;
cout<<i;
cout<<fixed<<setw(30)<<Start<<setw(20)<<fixed<<i*Addition<<endl;
}
}
what should i do??
Floating point addition doesn't work when you're adding (relatively) small number to (relatively) big one. It's caused by the way float is stored in memory.
You may try replacing single precision floating point (float) with double precision floating point (double) representation but if that doesn't work you'll probably need to implement hack like this:
// Lets say
double OriginalAddition = 0.123;
int Addition = 1;
// You just use base math substitution:
// Addition = OriginalAddition
int temp = Start + Addition; // You will treat transform floating point to fixed point
// with step 0.123, so 1 = 0.123
// And when displaying result (transform back into original floating point):
printf( "%f", (double)result*OriginalAddition)
This needs a lot of thought to find a substitution that doesn't cause data loss, covers required precision and won't cause int to overflow. Try to google fixed point int C (some results: 1, 2) to get better idea what to do.
Is there a way to convert a std::bitset<64> to a double without using any external library (Boost, etc.)? I am using a bitset to represent a genome in a genetic algorithm and I need a way to convert a set of bits to a double.
The C++11 road:
union Converter { uint64_t i; double d; };
double convert(std::bitset<64> const& bs) {
Converter c;
c.i = bs.to_ullong();
return c.d;
}
EDIT: As noted in the comments, we can use char* aliasing as it is unspecified instead of being undefined.
double convert(std::bitset<64> const& bs) {
static_assert(sizeof(uint64_t) == sizeof(double), "Cannot use this!");
uint64_t const u = bs.to_ullong();
double d;
// Aliases to `char*` are explicitly allowed in the Standard (and only them)
char const* cu = reinterpret_cast<char const*>(&u);
char* cd = reinterpret_cast<char*>(&d);
// Copy the bitwise representation from u to d
memcpy(cd, cu, sizeof(u));
return d;
}
C++11 is still required for to_ullong.
Most people are trying to provide answers that let you treat the bit-vector as though it directly contained an encoded int or double.
I would advise you completely avoid that approach. While it does "work" for some definition of working, it introduces hamming cliffs all over the place. You usually want your encoding to arrange things so that if two decoded values are near to one another, then their encoded values are near to one another as well. It also forces you to use 64-bits of precision.
I would manage the conversion manually. Say you have three variables to encode, x, y, and z. Your domain expertise can be used to say, for example, that -5 <= x < 5, 0 <= y < 100, and 0 <= z < 1, where you need 8 bits of precision for x, 12 bits for y, and 10 bits for z. This gives you a total search space of only 30 bits. You can have a 30 bit string, treat the first 8 as encoding x, the next 12 as y, and the last 10 as z. You are also free to gray code each one to remove the hamming cliffs.
I've personally done the following in the past:
inline void binary_encoding::encode(const vector<double>& params)
{
unsigned int start=0;
for(unsigned int param=0; param<params.size(); ++param) {
// m_bpp[i] = number of bits in encoding of parameter i
unsigned int num_bits = m_bpp[param];
// map the double onto the appropriate integer range
// m_range[i] is a pair of (min, max) values for ith parameter
pair<double,double> prange=m_range[param];
double range=prange.second-prange.first;
double max_bit_val=pow(2.0,static_cast<double>(num_bits))-1;
int int_val=static_cast<int>((params[param]-prange.first)*max_bit_val/range+0.5);
// convert the integer to binary
vector<int> result(m_bpp[param]);
for(unsigned int b=0; b<num_bits; ++b) {
result[b]=int_val%2;
int_val/=2;
}
if(m_gray) {
for(unsigned int b=0; b<num_bits-1; ++b) {
result[b]=!(result[b]==result[b+1]);
}
}
// insert the bits into the correct spot in the encoding
copy(result.begin(),result.end(),m_genotype.begin()+start);
start+=num_bits;
}
}
inline void binary_encoding::decode()
{
unsigned int start = 0;
// for each parameter
for(unsigned int param=0; param<m_bpp.size(); param++) {
unsigned int num_bits = m_bpp[param];
unsigned int intval = 0;
if(m_gray) {
// convert from gray to binary
vector<int> binary(num_bits);
binary[num_bits-1] = m_genotype[start+num_bits-1];
intval = binary[num_bits-1];
for(int i=num_bits-2; i>=0; i--) {
binary[i] = !(binary[i+1] == m_genotype[start+i]);
intval += intval + binary[i];
}
}
else {
// convert from binary encoding to integer
for(int i=num_bits-1; i>=0; i--) {
intval += intval + m_genotype[start+i];
}
}
// convert from integer to double in the appropriate range
pair<double,double> prange = m_range[param];
double range = prange.second - prange.first;
double m = range / (pow(2.0,double(num_bits)) - 1.0);
// m_phenotype is a vector<double> containing all the decoded parameters
m_phenotype[param] = m * double(intval) + prange.first;
start += num_bits;
}
}
Note that for reasons that probably don't matter to you, I wasn't using bit vectors -- just ordinary vector<int> to encoding things. And of course, there's a bunch of stuff tied into this code that isn't shown here, but you can probably get the basic idea.
One other note, if you're doing GPU calculations or if you have a particular problem such that 64 bits are the appropriate size anyway, it may be worth the extra overhead to stuff everything into native words. Otherwise, I would guess that the overhead you add to the search process will probably overwhelm whatever benefits you get by faster encoding and decoding.
Edit:: I've decided that I was being a bit silly with this. While you do end up with a double it assumes that the bitset holds an integer... which is a big assumption to make. You will end up with a predictable and repeatable value per bitset but still I don't think that this is what the author intended.
Well if you iterate over the bit values and do
output_double += pow( 2, 64-(bit_position+1) ) * bit_value;
That would work. As long as it is big-endian
I wrote this simple code which reads a length from the Sharp infrared sensor, end presents the average meter in cm (unit) by serial.
When write this code for the Arduino Mega board, the Arduino starts a blinking LED (pin 13) and the program does nothing. Where is the bug in this code?
#include <QueueList.h>
const int ANALOG_SHARP = 0; //Set pin data from sharp.
QueueList <float> queuea;
float cm;
float qu1;
float qu2;
float qu3;
float qu4;
float qu5;
void setup() {
Serial.begin(9600);
}
void loop() {
cm = read_gp2d12_range(ANALOG_SHARP); //Convert to cm (unit).
queuea.push(cm); //Add item to queue, when I add only this line Arduino crash.
if ( 5 <= queuea.peek()) {
Serial.println(average());
}
}
float read_gp2d12_range(byte pin) { //Function converting to cm (unit).
int tmp;
tmp = analogRead(pin);
if (tmp < 3)
return -1; // Invalid value.
return (6787.0 /((float)tmp - 3.0)) - 4.0;
}
float average() { //Calculate average length
qu1 += queuea.pop();
qu2 += queuea.pop();
qu3 += queuea.pop();
qu4 += queuea.pop();
qu5 += queuea.pop();
float aver = ((qu1+qu2+qu3+qu4+qu5)/5);
return aver;
}
I agree with the peek() -> count() error listed by vhallac. But I'll also point out that you should consider averaging by powers of 2 unless there is a strong case to do otherwise.
The reason is that on microcontrollers, division is slow. By averaging over a power of 2 (2,4,8,16,etc.) you can simply calculate the sum and then bitshift it.
To calculate the average of 2: (v1 + v2) >> 1
To calculate the average of 4: (v1 + v2 + v3 + v4) >> 2
To calculate the average of n values (where n is a power of 2) just right bitshift the sum right by [log2(n)].
As long as the datatype for your sum variable is big enough and won't overflow, this is much easier and much faster.
Note: this won't work for floats in general. In fact, microcontrollers aren't optimized for floats. You should consider converting from int (what I'm assuming you're ADC is reading) to float at the end after the averaging rather than before.
By converting from int to float and then averaging floats you are losing more precision than averaging ints than converting the int to a float.
Other:
You're using the += operator without initializing the variables (qu1, qu2, etc.) -- it's good practice to initialize them if you're going to use += but it looks as if = would work fine.
For floats, I'd have written the average function as:
float average(QueueList<float> & q, int n)
{
float sum = 0;
for(int i=0; i<n; i++)
{
sum += q.pop();
}
return (sum / (float) n);
}
And called it: average(queuea, 5);
You could use this to average any number of sensor readings and later use the same code to later average floats in a completely different QueueList. Passing the number of readings to average as a parameter will really come in handy in the case that you need to tweak it.
TL;DR:
Here's how I would have done it:
#include <QueueList.h>
const int ANALOG_SHARP=0; // set pin data from sharp
const int AvgPower = 2; // 1 for 2 readings, 2 for 4 readings, 3 for 8, etc.
const int AvgCount = pow(2,AvgPow);
QueueList <int> SensorReadings;
void setup(){
Serial.begin(9600);
}
void loop()
{
int reading = analogRead(ANALOG_SHARP);
SensorReadings.push(reading);
if(SensorReadings.count() > AvgCount)
{
int avg = average2(SensorReadings, AvgPower);
Serial.println(gpd12_to_cm(avg));
}
}
float gp2d12_to_cm(int reading)
{
if(reading <= 3){ return -1; }
return((6787.0 /((float)reading - 3.0)) - 4.0);
}
int average2(QueueList<int> & q, int AvgPower)
{
int AvgCount = pow(2, AvgPower);
long sum = 0;
for(int i=0; i<AvgCount; i++)
{
sum += q.pop();
}
return (sum >> AvgPower);
}
You are using queuea.peek() to obtain the count. This will only return the last element in queue. You should use queuea.count() instead.
Also you might consider changing the condition tmp < 3 to tmp <= 3. If tmp is 3, you divide by zero.
Great improvement jedwards, however the first question I have is why use queuelist instead of an int array.
As an example I would do the following:
int average(int analog_reading)
{
#define NUM_OF_AVG 5
static int readings[NUM_OF_AVG];
static int next_position;
static int sum;
if (++next_position >= NUM_OF_AVG)
{
next_position=0;
}
reading[next_position]=analog_reading;
for(int i=0; i<NUM_OF_AVG; i++)
{
sum += reading[i];
}
average = sum/NUM_OF_AVG
}
Now I compute a new rolling average with every reading and it eliminates all the issues related to dynamic memory allocation (memory fragmentation, no available memory, memory leaks) in a embedded device.
I appreciate and understand the use of shifting for a division by 2,4 or 8, however I would stay away from that technique for two reasons.
I think readability and maintainability of the source code is more important then saving a little bit of time with a shift instead of a divide unless you can test and verify the divide is a bottleneck.
Second, I believe most current optimizing compilers will do a shift if possible, I know GCC does.
I will leave refactoring out the for loop for the next guy.