Fast percentile in C++ - c++

My program calculates a Monte Carlo simulation for the value-at-risk metric. To simplify as much as possible, I have:
1/ simulated daily cashflows
2/ to get a sample of a possible 1-year cashflow,
I need to draw 365 random daily cashflows and sum them
Hence, the daily cashflows are an empirically given distrobution function to be sampled 365 times. For this, I
1/ sort the daily cashflows into an array called *this->distro*
2/ calculate 365 percentiles corresponding to random probabilities
I need to do this simulation of a yearly cashflow, say, 10K times to get a population of simulated yearly cashflows to work with. Having the distribution function of daily cashflows prepared, I do the sampling like...
for ( unsigned int idxSim = 0; idxSim < _g.xSimulationCount; idxSim++ )
{
generatedVal = 0.0;
for ( register unsigned int idxDay = 0; idxDay < 365; idxDay ++ )
{
prob = (FLT_TYPE)fastrand(); // prob [0,1]
dIdx = prob * dMaxDistroIndex; // scale prob to distro function size
// to get an index into distro array
_floor = ((FLT_TYPE)(long)dIdx); // fast version of floor
_ceil = _floor + 1.0f; // 'fast' ceil:)
iIdx1 = (unsigned int)( _floor );
iIdx2 = iIdx1 + 1;
// interpolation per se
generatedVal += this->distro[iIdx1]*(_ceil - dIdx );
generatedVal += this->distro[iIdx2]*(dIdx - _floor);
}
this->yearlyCashflows[idxSim] = generatedVal ;
}
The code inside of both for cycles does linear interpolation. If, say USD 1000 corresponds to prob=0.01, USD 10000 corresponds to prob=0.1 then if I don't have an empipirical number for p=0.05 I want to get USD 5000 by interpolation.
The question: this code runs correctly, though the profiler says that the program spends cca 60% of its runtime on the interpolation per se. So my question is, how can I make this task faster? Sample runtimes reported by VTune for specific lines are as follows:
prob = (FLT_TYPE)fastrand(); // 0.727s
dIdx = prob * dMaxDistroIndex; // 1.435s
_floor = ((FLT_TYPE)(long)dIdx); // 0.718s
_ceil = _floor + 1.0f; // -
iIdx1 = (unsigned int)( _floor ); // 4.949s
iIdx2 = iIdx1 + 1; // -
// interpolation per se
generatedVal += this->distro[iIdx1]*(_ceil - dIdx ); // -
generatedVal += this->distro[iIdx2]*(dIdx - _floor); // 12.704s
Dashes mean the profiler reports no runtimes for those lines.
Any hint will be greatly appreciated.
Daniel
EDIT:
Both c.fogelklou and MSalters have pointed out great enhancements. The best code in line with what c.fogelklou said is
converter = distroDimension / (FLT_TYPE)(RAND_MAX + 1)
for ( unsigned int idxSim = 0; idxSim < _g.xSimulationCount; idxSim++ )
{
generatedVal = 0.0;
for ( register unsigned int idxDay = 0; idxDay < 365; idxDay ++ )
{
dIdx = (FLT_TYPE)fastrand() * converter;
iIdx1 = (unsigned long)dIdx);
_floor = (FLT_TYPE)iIdx1;
generatedVal += this->distro[iIdx1] + this->diffs[iIdx1] *(dIdx - _floor);
}
}
and the best I have along MSalter's lines is
normalizer = 1.0/(FLT_TYPE)(RAND_MAX + 1);
for ( unsigned int idxSim = 0; idxSim < _g.xSimulationCount; idxSim++ )
{
generatedVal = 0.0;
for ( register unsigned int idxDay = 0; idxDay < 365; idxDay ++ )
{
dIdx = (FLT_TYPE)fastrand()* normalizer ;
iIdx1 = fastrand() % _g.xDayCount;
generatedVal += this->distro[iIdx1];
generatedVal += this->diffs[iIdx1]*dIdx;
}
}
The second code is approx. 30 percent faster. Now, of 95s of total runtime, the last line consumes 68s. The last but one line consumes only 3.2s hence the double*double multiplication must be the devil. I thought of SSE - saving the last three operands into an array and then carry out a vector multiplication of this->diffs[i]*dIdx[i] and add this to this->distro[i] but this code ran 50 percent slower. Hence, I think I hit the wall.
Many thanks to all.
D.

This is a proposal for a small optimization, removing the need for ceil, two casts, and one of the multiplies. If you are running on a fixed point processor, that would explain why the multiplies and casts between float and int are taking so long. In that case, try using fixed point optimizations or turning on floating point in your compiler if the CPU supports it!
for ( unsigned int idxSim = 0; idxSim < _g.xSimulationCount; idxSim++ )
{
generatedVal = 0.0;
for ( register unsigned int idxDay = 0; idxDay < 365; idxDay ++ )
{
prob = (FLT_TYPE)fastrand(); // prob [0,1]
dIdx = prob * dMaxDistroIndex; // scale prob to distro function size
// to get an index into distro array
iIdx1 = (long)dIdx;
_floor = (FLT_TYPE)iIdx1; // fast version of floor
iIdx2 = iIdx1 + 1;
// interpolation per se
{
const FLT_TYPE diff = this->distro[iIdx2] - this->distro[iIdx1];
const FLT_TYPE interp = this->distro[iIdx1] + diff * (dIdx - _floor);
generatedVal += interp;
}
}
this->yearlyCashflows[idxSim] = generatedVal ;
}

I would recommend to fix fastrand. Floating-point code isn't the fastest in the world, but what is especially slow is the switching between floating point and integer code. Since you need an integer index, use an integer random function.
It may even be advantageous to pre-generate all 365 random values in a loop. Since you need only log2(dMaxDistroIndex) bits of randomness per value, you may be able to reduce the number of RNG calls.
You would subsequently pick a random number between 0 and 1 for the interpolation fraction.

Related

Difference between logspace generators

Looking through ncmpcpp's spectrum visualizer code, I found a method that generates a "logspace," a vector used to group frequencies into log-scaled bins after applying an fft.
Here is the (isolated) code:
// Lowest frequency in display
const double HZ_MIN = 20;
// Highest frequency in display
const double HZ_MAX = 20000;
// Number of bars in spectrum
const size_t width = 100;
std::vector<double> dft_logspace;
void GenLogspace() {
// Calculate number of extra bins needed between 0 HZ and HZ_MIN
const size_t left_bins = (log10(HZ_MIN) - width*log10(HZ_MIN)) / (log10(HZ_MIN) - log10(HZ_MAX));
// Generate logspaced frequencies
dft_logspace.resize(width);
const double log_scale = log10(HZ_MAX) / (left_bins + dft_logspace.size() - 1);
for (size_t i = left_bins; i < dft_logspace.size() + left_bins; ++i) {
dft_logspace[i - left_bins] = pow(10, i * log_scale);
}
}
I spent a while trying to understand how this works... and it seems to be an awfully complicated way to get the same result as the following function, which works the way you'd expect:
Given limits a and b so that a < b, divide the interval [log10(a), log10(b)] into equal subintervals and exponential-map your way back.
// a = HZ_MIN, and
// b = HZ_MAX
void my_GenLogspace() {
dft_logspace.resize(width);
// Generate log-scaled frequency bins between HZ_MAX and HZ_MIN
for (size_t i = 0; i < width; i++) {
dft_logspace[i] = HZ_MIN * pow((HZ_MAX/HZ_MIN), ((double) i/(width-1)));
}
}
I'm fairly sure that these are mathematically identical.
Are they? Is there any reason to use original method over my rewrite? Does the author of the commit that introduced this code know something I don't?
Edit: (width-1), per Bob__'s suggestion
Got it. If anyone happens to need this later...
// Generate log-scaled vector of frequencies from HZ_MIN to HZ_MAX
void GenLogspace() {
// Prepare vector
dft_logspace.resize(width);
// Calculate number of extra bins needed between 0 HZ and HZ_MIN
// In logspace, divide the region between MAX and MIN into
// w - 1 equal segments (by fencepost, this gives us w seperators)
const double d = (
(log10(HZ_MAX) - log10(HZ_MIN))
/
(width - 1)
);
// Count how many of these segments will fit between
// 0 and MIN (note that we're still in logspace).
// This is how many log-scaled intervals are outside
// our desired range of frequencies.
const size_t skip_bins = log10(HZ_MIN) / d;
// Calculate log scale size.
// We can't use the value of d here, because d is "anchored" to both MIN and MAX.
// The last bin should be equal to MAX, but there may not be a bin that is equal to MIN.
//
// So, we re-partition our logspace:
// Divide the distance between 0 and MAX into equal partitions.
const double log_scale = log10(HZ_MAX) / (skip_bins + width - 1);
// Exponential-map bins out of logspace, skipping those that are outside our range.
// Note that the first (skipped) bin is ALWAYS 1, since 10^0 = 1.
// The last bin ALWAYS equals MAX.
for (size_t i = skip_bins; i < width + skip_bins; ++i) {
dft_logspace[i - skip_bins] = pow(10, i * log_scale);
}
}

OpenMp parallel for

I have the following method called pgain which calls the method dist that I am trying to parallize:
/******************************************************************************/
/* For a given point x, find the cost of the following operation:
* -- open a facility at x if there isn't already one there,
* -- for points y such that the assignment distance of y exceeds dist(y, x),
* make y a member of x,
* -- for facilities y such that reassigning y and all its members to x
* would save cost, realize this closing and reassignment.
*
* If the cost of this operation is negative (i.e., if this entire operation
* saves cost), perform this operation and return the amount of cost saved;
* otherwise, do nothing.
*/
/* numcenters will be updated to reflect the new number of centers */
/* z is the facility cost, x is the number of this point in the array
points */
double pgain ( long x, Points *points, double z, long int *numcenters )
{
int i;
int number_of_centers_to_close = 0;
static double *work_mem;
static double gl_cost_of_opening_x;
static int gl_number_of_centers_to_close;
int stride = *numcenters + 2;
//make stride a multiple of CACHE_LINE
int cl = CACHE_LINE/sizeof ( double );
if ( stride % cl != 0 ) {
stride = cl * ( stride / cl + 1 );
}
int K = stride - 2 ; // K==*numcenters
//my own cost of opening x
double cost_of_opening_x = 0;
work_mem = ( double* ) malloc ( 2 * stride * sizeof ( double ) );
gl_cost_of_opening_x = 0;
gl_number_of_centers_to_close = 0;
/*
* For each center, we have a *lower* field that indicates
* how much we will save by closing the center.
*/
int count = 0;
for ( int i = 0; i < points->num; i++ ) {
if ( is_center[i] ) {
center_table[i] = count++;
}
}
work_mem[0] = 0;
//now we finish building the table. clear the working memory.
memset ( switch_membership, 0, points->num * sizeof ( bool ) );
memset ( work_mem, 0, stride*sizeof ( double ) );
memset ( work_mem+stride,0,stride*sizeof ( double ) );
//my *lower* fields
double* lower = &work_mem[0];
//global *lower* fields
double* gl_lower = &work_mem[stride];
#pragma omp parallel for
for ( i = 0; i < points->num; i++ ) {
float x_cost = dist ( points->p[i], points->p[x], points->dim ) * points->p[i].weight;
float current_cost = points->p[i].cost;
if ( x_cost < current_cost ) {
// point i would save cost just by switching to x
// (note that i cannot be a median,
// or else dist(p[i], p[x]) would be 0)
switch_membership[i] = 1;
cost_of_opening_x += x_cost - current_cost;
} else {
// cost of assigning i to x is at least current assignment cost of i
// consider the savings that i's **current** median would realize
// if we reassigned that median and all its members to x;
// note we've already accounted for the fact that the median
// would save z by closing; now we have to subtract from the savings
// the extra cost of reassigning that median and its members
int assign = points->p[i].assign;
lower[center_table[assign]] += current_cost - x_cost;
}
}
// at this time, we can calculate the cost of opening a center
// at x; if it is negative, we'll go through with opening it
for ( int i = 0; i < points->num; i++ ) {
if ( is_center[i] ) {
double low = z + work_mem[center_table[i]];
gl_lower[center_table[i]] = low;
if ( low > 0 ) {
// i is a median, and
// if we were to open x (which we still may not) we'd close i
// note, we'll ignore the following quantity unless we do open x
++number_of_centers_to_close;
cost_of_opening_x -= low;
}
}
}
//use the rest of working memory to store the following
work_mem[K] = number_of_centers_to_close;
work_mem[K+1] = cost_of_opening_x;
gl_number_of_centers_to_close = ( int ) work_mem[K];
gl_cost_of_opening_x = z + work_mem[K+1];
// Now, check whether opening x would save cost; if so, do it, and
// otherwise do nothing
if ( gl_cost_of_opening_x < 0 ) {
// we'd save money by opening x; we'll do it
for ( int i = 0; i < points->num; i++ ) {
bool close_center = gl_lower[center_table[points->p[i].assign]] > 0 ;
if ( switch_membership[i] || close_center ) {
// Either i's median (which may be i itself) is closing,
// or i is closer to x than to its current median
points->p[i].cost = points->p[i].weight * dist ( points->p[i], points->p[x], points->dim );
points->p[i].assign = x;
}
}
for ( int i = 0; i < points->num; i++ ) {
if ( is_center[i] && gl_lower[center_table[i]] > 0 ) {
is_center[i] = false;
}
}
if ( x >= 0 && x < points->num ) {
is_center[x] = true;
}
*numcenters = *numcenters + 1 - gl_number_of_centers_to_close;
} else {
gl_cost_of_opening_x = 0; // the value we'll return
}
free ( work_mem );
return -gl_cost_of_opening_x;
}
The function that I am trying to parallelize:
/* compute Euclidean distance squared between two points */
float dist ( Point p1, Point p2, int dim )
{
float result=0.0;
#pragma omp parallel for reduction(+:result)
for (int i=0; i<dim; i++ ){
result += ( p1.coord[i] - p2.coord[i] ) * ( p1.coord[i] - p2.coord[i] );
}
return ( result );
}
With Point being this:
/* this structure represents a point */
/* these will be passed around to avoid copying coordinates */
typedef struct {
float weight;
float *coord;
long assign; /* number of point where this one is assigned */
float cost; /* cost of that assignment, weight*distance */
} Point;
I have a large application of streamcluster(815 lines of code) that produces real time numbers and sorts them in a specific way. I have used scalasca tool on Linux so I can measure the methods that take up most of the time and I have found that method dist listed above is the most time-consuming. I am trying to use openMP tools but the time that the parallelized code runs is more than the time the serial code. If serial code runs in 1,5 sec the parallelized takes 20 but the results are the same. And I am wondering is it that I can't parallelize this part of code for some reason or that I don't do it correctly.
The method I am trying to parallelize its in a call tree: main->pkmedian->pFL->pgain->dist (-> means that calls the following method)
The code you've chosen to parallelize:
float result=0.0;
#pragma omp parallel for reduction(+:result)
for (int i=0; i<dim; i++ ){
result += ( p1.coord[i] - p2.coord[i] ) * ( p1.coord[i] - p2.coord[i] );
}
is a poor candidate to benefit from parallelization. You should not use parallel for here. You should probably not use parallelization on an inner loop. If you can parallelize some outer loop, you're much more like to see gains.
There is an overhead to coordinate the thread team to start the parallel region and another overhead for performing the reduction afterwards. Meanwhile, the parallel region's contents take essentially no time to run. Given that, you'd need dim to be extremely large before you'd expect this to give a performance benefit.
To express that point more graphically, consider that the math you're doing will take nanoseconds and compare it against this chart showing the overhead of various OpenMP directives.
If you need this to run faster, your first stop should be to use appropriate compilation flags, followed by looking into SIMD operations: SSE and AVX are good keywords. Your compiler might even invoke them automatically.
I've built some test code (see below) and compiled it with various optimizations enabled, as listed below, and run it on arrays of 100,000 elements. Note that enabling -O3 results in a run-time that is on the order of the OpenMP directives. This implies that you'd want arrays of about 400,000 before you'd want to think about using OpenMP and probably more like 1,000,000, to be safe.
No optimizations. Run-time is ~1900μs.
-O3: Enables many optimizations. Run-time is ~200μs.
-ffast-math: You want this, unless you're doing some very tricky things. Run-time is about the same.
-march=native: Compile code to use the full capabilities of your CPU, rather than a generic instruction set that would work on many CPUs. Run-time is ~100μs.
So there we go, strategic use of compiler options (-march=native) can double the speed of the code in question without having to muck about in parallelism.
Here is a handy slide presentation with some tips explaining how to use OpenMP in a performant manner.
Test code:
#include <vector>
#include <cstdlib>
#include <chrono>
#include <iostream>
int main(){
std::vector<double> a;
std::vector<double> b;
for(int i=0;i<100000;i++){
a.push_back(rand()/(double)RAND_MAX);
b.push_back(rand()/(double)RAND_MAX);
}
std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
float result = 0.0;
//#pragma omp parallel for reduction(+:result)
for (unsigned int i=0; i<a.size(); i++ )
result += ( a[i] - b[i] ) * ( a[i] - b[i] );
std::chrono::steady_clock::time_point end= std::chrono::steady_clock::now();
std::cout << "Time difference = " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << " microseconds"<<std::endl;
}

FFT Spectrum not displaying correctly

I'm currently trying to display an audio spectrum using FFTW3 and SFML. I've followed the directions found here and looked at numerous references on FFT and spectrums and FFTW yet somehow my bars are almost all aligned to the left like below. Another issue I'm having is I can't find information on what the scale of the FFT output is. Currently I'm dividing it by 64 yet it still reaches beyond that occasionally. And further still I have found no information on why the output of the from FFTW has to be the same size as the input. So my questions are:
Why is the majority of my spectrum aligned to the left unlike the image below mine?
Why isn't the output between 0.0 and 1.0?
Why is the input sample count related to the fft output count?
What I get:
What I'm looking for:
const int bufferSize = 256 * 8;
void init() {
sampleCount = (int)buffer.getSampleCount();
channelCount = (int)buffer.getChannelCount();
for (int i = 0; i < bufferSize; i++) {
window.push_back(0.54f - 0.46f * cos(2.0f * GMath::PI * (float)i / (float)bufferSize));
}
plan = fftwf_plan_dft_1d(bufferSize, signal, results, FFTW_FORWARD, FFTW_ESTIMATE);
}
void update() {
int mark = (int)(sound.getPlayingOffset().asSeconds() * sampleRate);
for (int i = 0; i < bufferSize; i++) {
float s = 0.0f;
if (i + mark < sampleCount) {
s = (float)buffer.getSamples()[(i + mark) * channelCount] / (float)SHRT_MAX * window[i];
}
signal[i][0] = s;
signal[i][1] = 0.0f;
}
}
void draw() {
int inc = bufferSize / 2 / size.x;
int y = size.y - 1;
int max = size.y;
for (int i = 0; i < size.x; i ++) {
float total = 0.0f;
for (int j = 0; j < inc; j++) {
int index = i * inc + j;
total += std::sqrt(results[index][0] * results[index][0] + results[index][1] * results[index][1]);
}
total /= (float)(inc * 64);
Rectangle2I rect = Rectangle2I(i, y, 1, -(int)(total * max)).absRect();
g->setPixel(rect, Pixel(254, toColor(BLACK, GREEN)));
}
}
All of your questions are related to the FFT theory. Study the properties of FFT from any standard text/reference book and you will be able to answer your questions all by yourself only.
The least you can start from is here:
https://en.wikipedia.org/wiki/Fast_Fourier_transform.
Many FFT implementations are energy preserving. That means the scale of the output is linearly related to the scale and/or size of the input.
An FFT is a DFT is a square matrix transform. So the number of outputs will always be equal to the number of inputs (or half that by ignoring the redundant complex conjugate half given strictly real input), unless some outputs are thrown away. If not, it's not an FFT. If you want less outputs, there are ways to downsample the FFT output or post process it in other ways.

For an Arduino Sketch based light meter, functions outside of 'loop' are not being set off/firing

I'm very new to Arduino. I have much more experience with Java and ActionScript 3. I'm working on building a light meter out of an Arduino Uno and a TAOS TSL235R light-to-frequency converter.
I can only find a tuturial using a different sensor, so I am working my way through converting what I need to get it all to work (AKA some copy and paste, shamefully, but I'm new to this).
There are three parts: this is the first tutorial of the series Arduino and the Taos TSL230R Light Sensor: Getting Started.
The photographic conversion: Arduino and the TSL230R: Photographic Conversions.
At first, I could return values for the frequency created by the TSL235R sensor, but once I tried to add the code for photographic conversions I only get zero returned, and none of the funcions outside of the main loop seem to fire being that my Serial.Println() doesn't return anything.
I am more concerned with making the functions fire than if my math is perfect. In ActionScript and Java there are event listeners for functions and such, do I need to declare the function for it to fire in C/C++?
Basically, how can I make sure all my functions fire in the C programming language?
My Arduino Sketch:
// TSL230R Pin Definitions
#define TSL_FREQ_PIN 2
// Our pulse counter for our interrupt
unsigned long pulse_cnt = 0;
// How often to calculate frequency
// 1000 ms = 1 second
#define READ_TM 1000
// Two variables used to track time
unsigned long cur_tm = millis();
unsigned long pre_tm = cur_tm;
// We'll need to access the amount of time passed
unsigned int tm_diff = 0;
unsigned long frequency;
unsigned long freq;
float lux;
float Bv;
float Sv;
// Set our frequency multiplier to a default of 1
// which maps to output frequency scaling of 100x.
int freq_mult = 100;
// We need to measure what to divide the frequency by:
// 1x sensitivity = 10,
// 10x sensitivity = 100,
// 100x sensitivity = 1000
int calc_sensitivity = 10;
void setup() {
attachInterrupt(0, add_pulse, RISING); // Attach interrupt to pin2.
pinMode(TSL_FREQ_PIN, INPUT); //Send output pin to Arduino
Serial.begin(9600); //Start the serial connection with the copmuter.
}//setup
void loop(){
// Check the value of the light sensor every READ_TM ms and
// calculate how much time has passed.
pre_tm = cur_tm;
cur_tm = millis();
if( cur_tm > pre_tm ) {
tm_diff += cur_tm - pre_tm;
}
else
if( cur_tm < pre_tm ) {
// Handle overflow and rollover (Arduino 011)
tm_diff += ( cur_tm + ( 34359737 - pre_tm ));
}
// If enough time has passed to do a new reading...
if (tm_diff >= READ_TM ) {
// Reset the ms counter
tm_diff = 0;
// Get our current frequency reading
frequency = get_tsl_freq();
// Calculate radiant energy
float uw_cm2 = calc_uwatt_cm2( frequency );
// Calculate illuminance
float lux = calc_lux_single( uw_cm2, 0.175 );
}
Serial.println(freq);
delay(1000);
} //Loop
unsigned long get_tsl_freq() {
// We have to scale out the frequency --
// Scaling on the TSL230R requires us to multiply by a factor
// to get actual frequency.
unsigned long freq = pulse_cnt * 100;
// Reset pulse counter
pulse_cnt = 0;
return(freq);
Serial.println("freq");
} //get_tsl_freq
void add_pulse() {
// Increase pulse count
pulse_cnt++;
return;
Serial.println("Pulse");
}//pulse
float calc_lux_single(float uw_cm2, float efficiency) {
// Calculate lux (lm/m^2), using standard formula
// Xv = Xl * V(l) * Km
// where Xl is W/m^2 (calculate actual received uW/cm^2, extrapolate from sensor size
// to whole cm size, then convert uW to W),
// V(l) = efficiency function (provided via argument) and
// Km = constant, lm/W # 555 nm = 683 (555 nm has efficiency function of nearly 1.0).
//
// Only a single wavelength is calculated - you'd better make sure that your
// source is of a single wavelength... Otherwise, you should be using
// calc_lux_gauss() for multiple wavelengths.
// Convert to w_m2
float w_m2 = (uw_cm2 / (float) 1000000) * (float) 100;
// Calculate lux
float lux = w_m2 * efficiency * (float) 683;
return(lux);
Serial.println("Get lux");
} //lux_single
float calc_uwatt_cm2(unsigned long freq) {
// Get uW observed - assume 640 nm wavelength.
// Note the divide-by factor of ten -
// maps to a sensitivity of 1x.
float uw_cm2 = (float) freq / (float) 10;
// Extrapolate into the entire cm2 area
uw_cm2 *= ( (float) 1 / (float) 0.0136 );
return(uw_cm2);
Serial.println("Get uw_cm2");
} //calc_uwatt
float calc_ev( float lux, int iso ) {
// Calculate EV using the APEX method:
//
// Ev = Av + Tv = Bv + Sv
//
// We'll use the right-hand side for this operation:
//
// Bv = log2( B/NK )
// Sv = log2( NSx )
float Sv = log( (float) 0.3 * (float) iso ) / log(2);
float Bv = log( lux / ( (float) 0.3 * (float) 14 ) ) / log(2);
return( Bv + Sv );
Serial.println("get Bv+Sv");
}
float calc_exp_tm ( float ev, float aperture ) {
// Ev = Av + Tv = Bv + Sv
// need to determine Tv value, so Ev - Av = Tv
// Av = log2(Aperture^2)
// Tv = log2( 1/T ) = log2(T) = 2^(Ev - Av)
float exp_tm = ev - ( log( pow(aperture, 2) ) / log(2) );
float exp_log = pow(2, exp_tm);
return( exp_log );
Serial.println("get exp_log");
}
unsigned int calc_exp_ms( float exp_tm ) {
unsigned int cur_exp_tm = 0;
// Calculate mS of exposure, given a divisor exposure time.
if (exp_tm >= 2 ) {
// Deal with times less than or equal to half a second
if (exp_tm >= (float) int(exp_tm) + (float) 0.5 ) {
// Round up
exp_tm = int(exp_tm) + 1;
}
else {
// Round down
exp_tm = int(exp_tm);
}
cur_exp_tm = 1000 / exp_tm;
}
else if( exp_tm >= 1 ) {
// Deal with times larger than 1/2 second
float disp_v = 1 / exp_tm;
// Get first significant digit
disp_v = int( disp_v * 10 );
cur_exp_tm = ( 1000 * disp_v ) / 10;
}
else {
// Times larger than 1 second
int disp_v = int( (float) 1 / exp_tm);
cur_exp_tm = 1000 * disp_v;
}
return(cur_exp_tm);
Serial.println("get cur_exp_tm");
}
float calc_exp_aperture( float ev, float exp_tm ) {
float exp_apt = ev - ( log( (float) 1 / exp_tm ) / log(2) );
float apt_log = pow(2, exp_apt);
return( apt_log );
Serial.println("get apt_log");
}
That is a lot of code to read, where should I start.
In your loop() you are assigning frequency but printing freq
// get our current frequency reading
frequency = get_tsl_freq();
-- snip --
Serial.println(freq);
in get_tsl_freq() you are creating a local unsigned int freq that hides the global freq and using that for calculation and returning the value, maybe that is also a source of confusion for you. I do not see a reason for frequency and freq to be globals in this code. The function also contains unreachable code, the control will leave the function on return, statements after the return will never be executed.
unsigned long get_tsl_freq() {
unsigned long freq = pulse_cnt * 100; <-- hides global variable freq
// re-set pulse counter
pulse_cnt = 0;
return(freq); <-- ( ) not needed
Serial.println("freq"); <-- Unreachable
}
Reading a bit more I can suggest you pick up a C++ book and read a bit. While your code compiles it is not technically valid C++, you get away with it thanks to the Arduino software that does some mangling and what not to allow using functions before they are declared.
On constants you use in your calculations
float w_m2 = (uw_cm2 / (float) 1000000) * (float) 100;
could be written as
float w_m2 = (uw_cm2 / 1000000.0f) * 100.0f;
or even like this because uw_cm2 is a float
float w_m2 = (uw_cm2 / 1000000) * 100;
You also seem to take both approaches to waiting, you have code that calculates and only runs if it has been more than 1000 msec since it was last run, but then you also delay(1000) in the same code, this may not work as expected at all.

What is faster on division? doubles / floats / UInt32 / UInt64 ? in C++/C

I did some speed testing to figure out what is the fastest, when doing multiplication or division on numbers. I had to really work hard to defeat the optimiser. I got nonsensical results such as a massive loop operating in 2 microseconds, or that multiplication was the same speed as division (if only that were true).
After I finally worked hard enough to defeat enough of the compiler optimisations, while still letting it optimise for speed, I got these speed results. They maybe of interest to someone else?
If my test is STILL FLAWED, let me know, but be kind seeing as I just spend two hours writing this crap :P
64 time: 3826718 us
32 time: 2476484 us
D(mul) time: 936524 us
D(div) time: 3614857 us
S time: 1506020 us
"Multiplying to divide" using doubles seems the fastest way to do a division, followed by integer division. I did not test the accuracy of division. Could it be that "proper division" is more accurate? I have no desire to find out after these speed test results as I'll just be using integer division on a base 10 constant and letting my compiler optimise it for me ;) (and not defeating it's optimisations either).
Here's the code I used to get the results:
#include <iostream>
int Run(int bla, int div, int add, int minus) {
// these parameters are to force the compiler to not be able to optimise away the
// multiplications and divides :)
long LoopMax = 100000000;
uint32_t Origbla32 = 1000000000;
long i = 0;
uint32_t bla32 = Origbla32;
uint32_t div32 = div;
clock_t Time32 = clock();
for (i = 0; i < LoopMax; i++) {
div32 += add;
div32 -= minus;
bla32 = bla32 / div32;
bla32 += bla;
bla32 = bla32 * div32;
}
Time32 = clock() - Time32;
uint64_t bla64 = bla32;
clock_t Time64 = clock();
uint64_t div64 = div;
for (long i = 0; i < LoopMax; i++) {
div64 += add;
div64 -= minus;
bla64 = bla64 / div64;
bla64 += bla;
bla64 = bla64 * div64;
}
Time64 = clock() - Time64;
double blaDMul = Origbla32;
double multodiv = 1.0 / (double)div;
double multomul = div;
clock_t TimeDMul = clock();
for (i = 0; i < LoopMax; i++) {
multodiv += add;
multomul -= minus;
blaDMul = blaDMul * multodiv;
blaDMul += bla;
blaDMul = blaDMul * multomul;
}
TimeDMul = clock() - TimeDMul;
double blaDDiv = Origbla32;
clock_t TimeDDiv = clock();
for (i = 0; i < LoopMax; i++) {
multodiv += add;
multomul -= minus;
blaDDiv = blaDDiv / multomul;
blaDDiv += bla;
blaDDiv = blaDDiv / multodiv;
}
TimeDDiv = clock() - TimeDDiv;
float blaS = Origbla32;
float divS = div;
clock_t TimeS = clock();
for (i = 0; i < LoopMax; i++) {
divS += add;
divS -= minus;
blaS = blaS / divS;
blaS += bla;
blaS = blaS * divS;
}
TimeS = clock() - TimeS;
printf("64 time: %i us (%i)\n", (int)Time64, (int)bla64);
printf("32 time: %i us (%i)\n", (int)Time32, bla32);
printf("D(mul) time: %i us (%f)\n", (int)TimeDMul, blaDMul);
printf("D(div) time: %i us (%f)\n", (int)TimeDDiv, blaDDiv);
printf("S time: %i us (%f)\n", (int)TimeS, blaS);
return 0;
}
int main(int argc, char* const argv[]) {
Run(0, 10, 0, 0); // adds and minuses 0 so it doesn't affect the math, only kills the opts
return 0;
}
There are lots of ways to perform certain arithmetic, so there might not be a single answer (shifting, fractional multiplication, actual division, some round-trip through a logarithm unit, etc; these might all have different relative costs depending on the operands and resource allocation).
Let the compiler do its thing with the program and data flow information it has.
For some data applicable to assembly on x86, you might look at: "Instruction latencies and throughput for AMD and Intel x86 processors"
What is fastest will depend entirely on the target architecture. It looks here like you're interested only in the platform you happen to be on, which guessing from your execution times seems to be 64-bit x86, either Intel (Core2?) or AMD.
That said, floating-point multiplication by the inverse will be the fastest on many platforms, but is, as you speculate, usually less accurate than a floating-point divide (two roundings instead of one -- whether or not that matters for your usage is a separate question). In general, you are better off re-arranging your algorithm to use fewer divides than you are jumping through hoops to make division as efficient as possible (the fastest division is the one you don't do), and make sure to benchmark before you spend time optimizing at all, as algorithms that bottleneck on division are few and far between.
Also, if you have integer sources and need an integer result, make sure to include the cost of conversion between integer and floating-point in your benchmarking.
Since you're interested in timings on a specific machine, you should be aware that Intel now publishes this information in their Optimization Reference Manual (pdf). Specifically, you will be interested in the tables of Appendix C section 3.1, "Latency and Throughput with Register Operands".
Be aware that integer divide timings depend strongly on the actual values involved. Based on the information in that guide, it seems that your timing routines still have a fair bit of overhead, as the performance ratios you measure don't match up with Intel's published information.
As Stephen mentioned, use the optimisation manual - but you should also be considering the use of SSE instructions. These can do 4 or 8 divisions / multiplications in a single instruction.
Also, it is fairly common for a division to take a single clock cycle to process. The result may not be available for several clock cycles (called latency), however the next division can begin during this time (overlapping with the first) as long as it does not require the result from the first. This is due to pipe-lining in the CPU, in the same way as you can wash more clothes while the previous load is still drying.
Multiplying to divide is a common trick, and should be used wherever your divisor changes infrequently.
There is a very good chance that you will spend time and effort making the maths fast only to discover that it is the speed of memory access (as you navigate the input and write the output) that limits your final implimentation.
I wrote a flawed test to do this on MSVC 2008
double i32Time = GetTime();
{
volatile __int32 i = 4;
__int32 count = 0;
__int32 max = 1000000;
while( count < max )
{
i /= 61;
count++;
}
}
i32Time = GetTime() - i32Time;
double i64Time = GetTime();
{
volatile __int64 i = 4;
__int32 count = 0;
__int32 max = 1000000;
while( count < max )
{
i /= 61;
count++;
}
}
i64Time = GetTime() - i64Time;
double fTime = GetTime();
{
volatile float i = 4;
__int32 count = 0;
__int32 max = 1000000;
while( count < max )
{
i /= 4.0f;
count++;
}
}
fTime = GetTime() - fTime;
double fmTime = GetTime();
{
volatile float i = 4;
const float div = 1.0f / 4.0f;
__int32 count = 0;
__int32 max = 1000000;
while( count < max )
{
i *= div;
count++;
}
}
fmTime = GetTime() - fmTime;
double dTime = GetTime();
{
volatile double i = 4;
__int32 count = 0;
__int32 max = 1000000;
while( count < max )
{
i /= 4.0f;
count++;
}
}
dTime = GetTime() - dTime;
double dmTime = GetTime();
{
volatile double i = 4;
const double div = 1.0f / 4.0f;
__int32 count = 0;
__int32 max = 1000000;
while( count < max )
{
i *= div;
count++;
}
}
dmTime = GetTime() - dmTime;
DebugOutput( _T( "%f\n" ), i32Time );
DebugOutput( _T( "%f\n" ), i64Time );
DebugOutput( _T( "%f\n" ), fTime );
DebugOutput( _T( "%f\n" ), fmTime );
DebugOutput( _T( "%f\n" ), dTime );
DebugOutput( _T( "%f\n" ), dmTime );
DebugBreak();
I then ran it on an AMD64 Turion 64 in 32-bit mode. The results I got were as follows:
0.006622
0.054654
0.006283
0.006353
0.006203
0.006161
The reason the test is flawed is the usage of volatile which forces the compiler to re-load the variable from memory just in case its changed. All in it show there is precious little difference between any of the implementations on this machine (__int64 is obviously slow).
It also categorically shows that the MSVC compiler performs the multiply by reciprocal optimisation. I imagine GCC does the same if not better. If i change the float and double division checks to divide by "i" then it increases the time significantly. Though, while a lot of that could be the re-loading from disk, it is obvious the compiler can't optimise that away so easily.
To understand such micro-optimisations try reading this pdf.
All in I'd argue that if you are worrying about such things you obviously haven't profiled your code. Profile and fix the problems as and when they actually ARE a problem.
Agner Fog has done some pretty detailed measurements himself, which can be found here. If you're really trying to optimize stuff, you should read the rest of the documents from his software optimization resources as well.
I would point out that, even if you are measuring non-vectorized floating point operations, the compiler has two options for the generated assembly: it can use the FPU instructions (fadd, fmul) or it can use SSE instructions while still manipulate one floating point value per instruction (addss, mulss). In my experience the SSE instructions are faster and have less inaccuracies, but compilers don't make it the default because it could break compatibility with code that relies on the old behavior. You can turn it on in gcc with the -mfpmath=sse flag.