How can I estimate the DC output of a solar plant consisting of multiple modules and inverters in PVLib? - pvlib

I'm using the ModelChain class to estimate DC and AC values for a fictitious solar plant. Input parameters include module, inverter, number of strings, number of modules, number of inverters, albedo, PVGIS TMY data, etc. I apply simple math to calculate number of modules per string and number of strings per inverter then I create one PVSystem object consisting of a single PVArray, per inverter. The I run the ModelChain model for each inverter and, for simplicity, add up the AC output to estimate the total AC for all arrays like this:
for idx in range(0, num_of_inverters):
array = {
'name': f'pvsystem-{idx+1}-array',
'mount': mount,
'module': module_name,
'module_parameters': module_parameters,
'module_type': module_type,
'albedo': albedo,
'strings': strings_per_inverter,
'modules_per_string': modules_per_string,
'temperature_model_parameters': temperature_model_parameters,
}
pvsystem=pvlib.pvsystem.PVSystem(arrays=[pvlib.pvsystem.Array(**array)], inverter_parameters=inverter_parameters)
mc = pvlib.modelchain.ModelChain(pvsystem, location)
mc.run_model(tmy_weather)
total_ac += mc.results.ac.sum()
According to PVLib documentation, the AC output is yearly in Watts hour.
But now I need to get the DC output as well (yearly in Watts hours) so I can calculate the DC/AC ratio. Running mc.results.dc gives me a Dataframe with several values (columns) that are hard to grasp for a newbie like me:
i_sc : Short-circuit current (A)
i_mp : Current at the maximum-power point (A)
v_oc : Open-circuit voltage (V)
v_mp : Voltage at maximum-power point (V)
p_mp : Power at maximum-power point (W)
i_x : Current at module V = 0.5Voc, defines 4th point on I-V curve for modeling curve shape
i_xx : Current at module V = 0.5(Voc+Vmp), defines 5th point on I-V curve for modeling curve shape
I tried using p_mp and adding it up: mc.results.dc['p_mp'].sum() but the output is much bigger than the estimated AC. I usually expect the DC/AC ratio to be somewhere > 1 and <= 1.5, roughly. However, I'm getting DC values that are like 3-5 times bigger which probably means I'm doing something wrong.
Example: 1 string, 1 inverter, 10 modules per string:
Output (yearly):
AC: 869.61kW
DC: 3326.36kW
Ratio: 3.83
Any help is appreciated.

As for why the total DC and AC generation values are so different, it's because the inverter is way undersized for the array. The inverter is rated for 250 W maximum, which is not much more than what a single module produces at STC (calculate by Impo * Vmpo as below, or noticing the "220" in the module name), and you have ten modules total. So the inverter will be saturated at even very low light, and the total AC production will be severely curtailed as a result. I think if you make a plot (mc.results.ac.plot()) you will see that the daily inverter output curve is clipped at 250 W while the simulated DC power can be nearly 10x higher. It's always a good idea to plot your time series when things aren't making sense!
In [23]: pvlib.pvsystem.retrieve_sam('cecinverter')['ABB__MICRO_0_25_I_OUTD_US_208__208V_']['Paco']
Out[23]: 250.0
In [24]: pvlib.pvsystem.retrieve_sam('sandiamod')['Canadian_Solar_CS5P_220M___2009_'][['Impo', 'Vmpo']]
Out[24]:
Impo 4.54629
Vmpo 48.3156
Name: Canadian_Solar_CS5P_220M___2009_, dtype: object
A couple other notes:
Please be careful about units:
Summing (really, integrating) an hourly time series of power (Watts) produces energy (Watt-hours). An annual output in kW doesn't make sense, since kW is for power and power is an instantaneous rate of energy generation. If this is new to you, it might be helpful to think about speed vs distance: a car might be traveling at 60mph at any given time point, but the total distance it travels in a year is measured in miles, not mph. Power is energy per unit time just like speed is distance per unit time.
Summing voltages (num_of_inverters * mc.results.dc['v_mp'].sum()) makes no sense that I can see. Volt-hours doesn't seem like a useful unit to me outside of some very specialized power electronics engineering contexts.
The term "DC/AC ratio" is typically understood to mean the ratio of rated capacities, not annual productions. So for the example in your gist, the DC/AC ratio would be calculated as (220 W/module * 10 modules/string * 2 strings/inverter = 4400 W DC) / (250 W AC) = 17.6 (which is a crazy DC/AC ratio).

Related

Basic example of how to do numerical integration in C++

I think most people know how to do numerical derivation in computer programming, (as limit --> 0; read: "as the limit approaches zero").
//example code for derivation of position over time to obtain velocity
float currPosition, prevPosition, currTime, lastTime, velocity;
while (true)
{
prevPosition = currPosition;
currPosition = getNewPosition();
lastTime = currTime;
currTime = getTimestamp();
// Numerical derivation of position over time to obtain velocity
velocity = (currPosition - prevPosition)/(currTime - lastTime);
}
// since the while loop runs at the shortest period of time, we've already
// achieved limit --> 0;
This is the basic building block for most derivation programming.
How can I do this with integrals? Do I use a for loop and add or what?
Numerical derivation and integration in code for physics, mapping, robotics, gaming, dead-reckoning, and controls
Pay attention to where I use the words "estimate" vs "measurement" below. The difference is important.
Measurements are direct readings from a sensor.
Ex: a GPS measures position (meters) directly, and a speedometer measures speed (m/s) directly.
Estimates are calculated projections you can obtain through integrating and derivating (deriving) measured values.
Ex: you can derive position measurements (m) with respect to time to obtain speed or velocity (m/s) estimates, and you can integrate speed or velocity measurements (m/s) with respect to time to obtain position or displacement (m) estimates.
Wait, aren't all "measurements" actually just "estimates" at some fundamental level?
Yeah--pretty much! But, they are not necessarily produced through derivations or integrations with respect to time, so that is a bit different.
Also note that technically, virtually nothing can truly be measured directly. All sensors get reduced down to a voltage or a current, and guess how you measure a current?--a voltage!--either as a voltage drop across a tiny resistance, or as a voltage induced through an inductive coil due to current flow. So, everything boils down to a voltage. Even devices which "measure speed directly" may be using pressure (pitot-static tube on airplane), doppler/phase shift (radar or sonar), or looking at distance over time and then outputting speed. Fluid speed, or speed with respect to fluid such as air or water, can even be measured via a hot wire anemometer by measuring the current required to keep a hot wire at a fixed temperature, or by measuring the temperature change of the hot wire at a fixed current. And how is that temperature measured? Temperature is just a thermo-electrically-generated voltage, or a voltage drop across a diode or other resistance.
As you can see, all of these "measurements" and "estimates", at the low level, are intertwined. However, if a given device has been produced, tested, and calibrated to output a given "measurement", then you can accept it as a "source of truth" for all practical purposes and call it a "measurement". Then, anything you derive from that measurement, with respect to time or some other variable, you can consider an "estimate". The irony of this is that if you calibrate your device and output derived or integrated estimates, someone else could then consider your output "estimates" as their input "measurements" in their system, in a sort of never-ending chain down the line. That's being pedantic, however. Let's just go with the simplified definitions I have above for the time being.
The following table is true, for example. Read the 2nd line, for instance, as: "If you take the derivative of a velocity measurement with respect to time, you get an acceleration estimate, and if you take its integral, you get a position estimate."
Derivatives and integrals of position
Measurement, y Derivative Integral
Estimate (dy/dt) Estimate (dy*dt)
----------------------- ----------------------- -----------------------
position [m] velocity [m/s] - [m*s]
velocity [m/s] acceleration [m/s^2] position [m]
acceleration [m/s^2] jerk [m/s^3] velocity [m/s]
jerk [m/s^3] snap [m/s^4] acceleration [m/s^2]
snap [m/s^4] crackle [m/s^5] jerk [m/s^3]
crackle [m/s^5] pop [m/s^6] snap [m/s^4]
pop [m/s^6] - [m/s^7] crackle [m/s^5]
For jerk, snap or jounce, crackle, and pop, see: https://en.wikipedia.org/wiki/Fourth,_fifth,_and_sixth_derivatives_of_position.
1. numerical derivation
Remember, derivation obtains the slope of the line, dy/dx, on an x-y plot. The general form is (y_new - y_old)/(x_new - x_old).
In order to obtain a velocity estimate from a system where you are obtaining repeated position measurements (ex: you are taking GPS readings periodically), you must numerically derivate your position measurements over time. Your y-axis is position, and your x-axis is time, so dy/dx is simply (position_new - position_old)/(time_new - time_old). A units check shows this might be meters/sec, which is indeed a unit for velocity.
In code, that would look like this, for a system where you're only measuring position in 1-dimension:
double position_new_m = getPosition(); // m = meters
double position_old_m;
// `getNanoseconds()` should return a `uint64_t timestamp in nanoseconds, for
// instance
double time_new_sec = NS_TO_SEC((double)getNanoseconds());
double time_old_sec;
while (true)
{
position_old_m = position_new_m;
position_new_m = getPosition();
time_old_sec = time_new_sec;
time_new_sec = NS_TO_SEC((double)getNanoseconds());
// Numerical derivation of position measurements over time to obtain
// velocity in meters per second (mps)
double velocity_mps =
(position_new_m - position_old_m)/(time_new_sec - time_old_sec);
}
2. numerical integration
Numerical integration obtains the area under the curve, dy*dx, on an x-y plot. One of the best ways to do this is called trapezoidal integration, where you take the average dy reading and multiply by dx. This would look like this: (y_old + y_new)/2 * (x_new - x_old).
In order to obtain a position estimate from a system where you are obtaining repeated velocity measurements (ex: you are trying to estimate distance traveled while only reading the speedometer on your car), you must numerically integrate your velocity measurements over time. Your y-axis is velocity, and your x-axis is time, so (y_old + y_new)/2 * (x_new - x_old) is simply velocity_old + velocity_new)/2 * (time_new - time_old). A units check shows this might be meters/sec * sec = meters, which is indeed a unit for distance.
In code, that would look like this. Notice that the numerical integration obtains the distance traveled over that one tiny time interval. To obtain an estimate of the total distance traveled, you must sum all of the individual estimates of distance traveled.
double velocity_new_mps = getVelocity(); // mps = meters per second
double velocity_old_mps;
// `getNanoseconds()` should return a `uint64_t timestamp in nanoseconds, for
// instance
double time_new_sec = NS_TO_SEC((double)getNanoseconds());
double time_old_sec;
// Total meters traveled
double distance_traveled_m_total = 0;
while (true)
{
velocity_old_mps = velocity_new_mps;
velocity_new_mps = getVelocity();
time_old_sec = time_new_sec;
time_new_sec = NS_TO_SEC((double)getNanoseconds());
// Numerical integration of velocity measurements over time to obtain
// a distance estimate (in meters) over this time interval
double distance_traveled_m =
(velocity_old_mps + velocity_new_mps)/2 * (time_new_sec - time_old_sec);
distance_traveled_m_total += distance_traveled_m;
}
See also: https://en.wikipedia.org/wiki/Numerical_integration.
Going further:
high-resolution timestamps
To do the above, you'll need a good way to obtain timestamps. Here are various techniques I use:
In C++, use my uint64_t nanos() function here.
If using Linux in C or C++, use my uint64_t nanos() function which uses clock_gettime() here. Even better, I have wrapped it up into a nice timinglib library for Linux, in my eRCaGuy_hello_world repo here:
timinglib.h
timinglib.c
Here is the NS_TO_SEC() macro from timing.h:
#define NS_PER_SEC (1000000000L)
/// Convert nanoseconds to seconds
#define NS_TO_SEC(ns) ((ns)/NS_PER_SEC)
If using a microcontroller, you'll need to read an incrementing periodic counter from a timer or counter register which you have configured to increment at a steady, fixed rate. Ex: on Arduino: use micros() to obtain a microsecond timestamp with 4-us resolution (by default, it can be changed). On STM32 or others, you'll need to configure your own timer/counter.
use high data sample rates
Taking data samples as fast as possible in a sample loop is a good idea, because then you can average many samples to achieve:
Reduced noise: averaging many raw samples reduces noise from the sensor.
Higher-resolution: averaging many raw samples actually adds bits of resolution in your measurement system. This is known as oversampling.
I write about it on my personal website here: ElectricRCAircraftGuy.com: Using the Arduino Uno’s built-in 10-bit to 16+-bit ADC (Analog to Digital Converter).
And Atmel/Microchip wrote about it in their white-paper here: Application Note AN8003: AVR121: Enhancing ADC resolution by oversampling.
Taking 4^n samples increases your sample resolution by n bits of resolution. For example:
4^0 = 1 sample at 10-bits resolution --> 1 10-bit sample
4^1 = 4 samples at 10-bits resolution --> 1 11-bit sample
4^2 = 16 samples at 10-bits resolution --> 1 12-bit sample
4^3 = 64 samples at 10-bits resolution --> 1 13-bit sample
4^4 = 256 samples at 10-bits resolution --> 1 14-bit sample
4^5 = 1024 samples at 10-bits resolution --> 1 15-bit sample
4^6 = 4096 samples at 10-bits resolution --> 1 16-bit sample
See:
So, sampling at high sample rates is good. You can do basic filtering on these samples.
If you process raw samples at a high rate, doing numerical derivation on high-sample-rate raw samples will end up derivating a lot of noise, which produces noisy derivative estimates. This isn't great. It's better to do the derivation on filtered samples: ex: the average of 100 or 1000 rapid samples. Doing numerical integration on high-sample-rate raw samples, however, is fine, because as Edgar Bonet says, "when integrating, the more samples you get, the better the noise averages out." This goes along with my notes above.
Just using the filtered samples for both numerical integration and numerical derivation, however, is just fine.
use reasonable control loop rates
Control loop rates should not be too fast. The higher the sample rates, the better, because you can filter them to reduce noise. The higher the control loop rate, however, not necessarily the better, because there is a sweet spot in control loop rates. If your control loop rate is too slow, the system will have a slow frequency response and won't respond to the environment fast enough, and if the control loop rate is too fast, it ends up just responding to sample noise instead of to real changes in the measured data.
Therefore, even if you have a sample rate of 1 kHz, for instance, to oversample and filter the data, control loops that fast are not needed, as the noise from readings of real sensors over very small time intervals will be too large. Use a control loop anywhere from 10 Hz ~ 100 Hz, perhaps up to 400+ Hz for simple systems with clean data. In some scenarios you can go faster, but 50 Hz is very common in control systems. The more-complicated the system and/or the more-noisy the sensor measurements, generally, the slower the control loop must be, down to about 1~10 Hz or so. Self-driving cars, for instance, which are very complicated, frequently operate at control loops of only 10 Hz.
loop timing and multi-tasking
In order to accomplish the above, independent measurement and filtering loops, and control loops, you'll need a means of performing precise and efficient loop timing and multi-tasking.
If needing to do precise, repetitive loops in Linux in C or C++, use the sleep_until_ns() function from my timinglib above. I have a demo of my sleep_until_us() function in-use in Linux to obtain repetitive loops as fast as 1 KHz to 100 kHz here.
If using bare-metal (no operating system) on a microcontroller as your compute platform, use timestamp-based cooperative multitasking to perform your control loop and other loops such as measurements loops, as required. See my detailed answer here: How to do high-resolution, timestamp-based, non-blocking, single-threaded cooperative multi-tasking.
full, numerical integration and multi-tasking example
I have an in-depth example of both numerical integration and cooperative multitasking on a bare-metal system using my CREATE_TASK_TIMER() macro in my Full coulomb counter example in code. That's a great demo to study, in my opinion.
Kalman filters
For robust measurements, you'll probably need a Kalman filter, perhaps an "unscented Kalman Filter," or UKF, because apparently they are "unscented" because they "don't stink."
See also
My answer on Physics-based controls, and control systems: the many layers of control

How to sample a value n from one distribution and take n values from another distribution

Disclaimer: I have no idea what I'm doing, in either pymc3 or Bayesian statistics, but I'm trying to learn by diving straight in...
What I think I want to do is:
I have one distribution, A, representing a frequency of events
I have another distribution, B, representing a "cost" of each event
I want to use a sampled frequency, n, from A and sample n costs from B
The result I care about from each draw is the sum of costs
So each draw I would have a variable number of values from B.
My naive attempt to code this was:
with pm.Model() as model:
events = pm.Poisson("events", mu=5)
count = events.random()
losses = [
pm.Lognormal(f"loss[{i}]", 1.2, 0.2)
for i in range(count)
]
year_loss = pm.Deterministic("year_loss", pm.math.sum(losses))
trace = pm.sample(draws=500, chains=1)
total_loss = trace.get_values("year_loss").sum()
print(f"total_loss: {total_loss}")
mean_loss = total_loss / 500
print(f"mean_loss: {mean_loss}")
This "sort of works" in that I can get some output like:
Sequential sampling (1 chains in 1 job)
CompoundStep
>Metropolis: [events]
>NUTS: [loss[4], loss[3], loss[2], loss[1], loss[0]]
Sampling 1 chain for 1_000 tune and 500 draw iterations (1_000 + 500 draws total) took 2 seconds.02<00:00 Sampling chain 0, 0 divergences]
Only one chain was sampled, this makes it impossible to run some convergence checks
total_loss: 8431.667563769252
mean_loss: 16.8633351275385
But what is really happening is that count only has a single value across all draws in the trace.
Instead I wanted it to have an independent value for each draw.
Can someone kindly point me in the right direction?

SAS Sample Size Estimation

The seeds of the garden pea are either yellow or green. A certain cross between pea plants produces progeny where 75% are plants with yellow seeds and 25% are plants with green seeds. What is the minimum number of progeny you would need to grow to have probability no less than 0.99 of obtaining at least 10 plants with green seeds?
I understand how to estimate a required sample size when I have data such as standard deviation, mean, correlation, etc., but I don't even know where to start to estimate it based on the percentage values with a certain probability.
So far I set up this code in SAS:
Proc power;
onesamplefreq test=Z method=normal
sides=1
alpha=.01
nullproportion=.5
proportion=.25
power=.99
ntotal= .;
run;
Running this program resulted in a sample size of 76, but I don't feel like this is correct. I don't know how to specify that I need at least 10 plants with green seeds, and I don't know how to set the nullproportion or if it matters.
It is a Binomial distribution kind problem. Where chance of winning (green plant) is 25%. You want to win at least 10 times, so how many times you need to play (that is, how many seeds you need)?
Mean of binomial distribution will answer this question which is:
np = 10
n*0.25 = 10
n = 40
So required seed is 40. This is purely probabilistic. But we need to consider Type I and Type II error. So sample size 76 seems reasonable to me.

How to find why a RBM does not work correctly?

I'm trying to implement a RBM and I'm testing it on MNIST dataset. However, it does not seems to converge.
I've 28x28 visible units and 100 hidden units. I'm using mini-batches of size 50. For each epoch, I traverse the whole dataset. I've a learning rate of 0.01 and a momentum of 0.5. The weights are randomly generated based on a Gaussian distribution of mean 0.0 and stdev of 0.01. The visible and hidden biases are initialized to 0. I'm using a logistic sigmoid function as activation.
After each epoch, I compute the average reconstruction error of all mini-batches, here are the errors I get:
epoch 0: Reconstruction error average: 0.0481795
epoch 1: Reconstruction error average: 0.0350295
epoch 2: Reconstruction error average: 0.0324191
epoch 3: Reconstruction error average: 0.0309714
epoch 4: Reconstruction error average: 0.0300068
I plotted the histograms of the weights to check (left to right: hiddens, weights, visibles. top: weights, bottom: updates):
Histogram of the weights after epoch 3
Histogram of the weights after epoch 3 http://baptiste-wicht.com/static/finals/histogram_epoch_3.png
Histogram of the weights after epoch 4
Histogram of the weights after epoch 4 http://baptiste-wicht.com/static/finals/histogram_epoch_4.png
but, except for the hidden biases that seem a bit weird, the remaining seems OK.
I also tried to plot the hidden weights:
Weights after epoch 3
Weights after epoch 3 http://baptiste-wicht.com/static/finals/hiddens_weights_epoch_3.png
Weights after epoch 4
Weights after epoch 4 http://baptiste-wicht.com/static/finals/hiddens_weights_epoch_4.png
(they are plotted in two colors using that function:
static_cast<size_t>(value > 0 ? (static_cast<size_t>(value * 255.0) << 8) : (static_cast<size_t>(-value * 255.)0) << 16) << " ";
)
And here, they do not make sense at all...
If I go further, the reconstruction error falls a bit more, but do no go further than 0.025. Even if I change the momentum after sometime, it goes higher and then goes down a bit but not interestingly. Moreover, the weights do no make more sense after more epochs. In most example implementations I've seen, the weights were making some sense after iterating through the complete data set two or three times.
I've also tried to reconstruct an image from the visible units, but the results seems almost random.
What could I do to check what goes wrong in my implementation ? Should the weights be within some range ? Does something seems really strange in the data ?
Complete code: https://github.com/wichtounet/dbn/blob/master/include/rbm.hpp
You are using a very small learning rate. In most NNs trained by SGD you start out with a higher learning rate and decay it over time. Search for learning rate or adaptive learning rate to find more information on that.
Second, when implementing a new algorithm I would recommend finding the paper that introduced it and reproducing their results. A good paper should include most of the settings used - or the method used to determine the settings.
If a paper is unavailable, or it was tested on a data set you don't have access to - go find a working implementation and compare the outputs when using the same settings. If the implementations are not feature compatible, turn off as many features as you can that are not shared.

Measure variation of data points from a line; To Catch a Dip

How can I measure this area in C++?
(update: I posted the solution and code as an answer rather than edit the question again)
The ideal line (dashed red) is the plot from starting point with the average rise added with each angle of measurement; this I obtain via average. I measured the test data in black. How can I quantify the area of the dip in blue? X-axis is unitized, so slopes and math are simplified.
I could determine a cutoff for the size of areas like this and then flag this part for retesting or failure. Rarely, there is another dip that appears closer to the right, but setting a cutoff value for standard deviation usually fails those parts.
Update
Diego's answer helped me visualize this. Now that I can see what I'm trying to do, I'll work on the algorithm to implement the "homemade dip detector". :)
Why?
I created a test bench to test throttle position sensors I'm selling. I'm trying to programatically quantify how straight the plot is by analyzing the data collected. This one particular model is vexing me.
Sample plot of a part I prefer not to sell:
The X axis are evenly spaced angles of throttle opening. The stepper motor turns the input shaft, stopping every 0.75° to measure the output on a 10 bit ADC, which gets translated to the Y axis. The plot is the translation of data[idx] to idx,value mapped to (x,y) bitmap coordinates. Then I draw lines between the points within the bitmap using Bresenham's algorithm.
My other TPS products produce amazingly linear output.
The lower (left) portion of the plot is crucial to normal usage of any motor vehicle; it's when you're driving around town, entering parking lots, etc. This particular part has a tendency to develop a dip around 15° opening and I wish to use the program to quantify this "dip" in the curve and rely less upon the tester's intuition. In the above example, the plot dips but doesn't return to what an ideal line might be.
Even though this is an embedded application, printing the report takes 10 seconds, thus I do not consider stepping through an array of 120 points of data multiple times a waste of cycles. Also, since I'm using a uC32 PIC32 microcontroller, there's plenty of memory, so I have the luxury of being able to ponder this problem within the controller.
What I'm trying already
Array of rise between test points: I dismiss the X-axis entirely, considering it unitized, and then make an array of change from one reading to the next. This array is what contributes to the report's "Min rise between points: 0 Max: 14". I call this array deltas.
I've tried using standard deviation on deltas, however, during testing I have found that a low Std Dev is not a reliable measure for this part. If the dip quickly returns to the original line implied by early data points, the Std Dev can be deceptively low (observed to be as low as 2.3) but the part is still something I wouldn't want to use. I tried setting a cutoff at 2.6, but it failed too many parts with great plots. The other, more linear part linked to above can reliably count on Std Dev for quality.
Kurtosis seems not to apply for this situation at all. I learned of Kurtosis today and found a Statistics Library which includes Kurtosis and Skewness. During continued testing, I found that of these two measures, there was not a trend of positive, negative, or amplitude which would correspond to either passing or failing. That same gentleman has shared a linear regression library, but I believe Lin Reg is unrelated to my situation, as I am comfortable with the assumption of the AVG of deltas being my ideal line. Linear Regression and R^2 are more for finding a line from less ideal data or much larger sets.
Comparing each delta to AVG and Std Dev I set up a monitor to check each delta against final average of the deltas's data. Here, too, I couldn't find a reliable metric. Too many good parts would not pass a test restricting any delta to within 2x Std Dev away from the Average. Ultimately, the only variation from AVG I could settle on is to be within AVG+Std Dev difference from the AVG itself. Anything more restrictive would fail otherwise good parts. And the elusive dip around 15° opening can sneak through this test.
Homemade dip detector When feeding deltas to the serial monitor of the computer, I observed consecutive negative deltas during the dip, so I programmed in a dip detector, but it feels very crude to me. If there are 5 or more negative deltas in a row, I sum them. I have seen that if I take that sum the dip's differences from AVG then divide by the number of negative deltas, a value over 2.9 or 3 could mean a fail. I have observed dips lasting from 6 to 15 deltas. Readily observable dips would have their differences from AVG sum up to -35.
Trending accumulated variation from the AVG The above made me think watching the summation of deltas as it wanders away from AVG could be the answer. Meaning, I step through the array and sum the differences of each delta from AVG. I thought I was on to something until a good part blew this theory. I was seeing a trend of the fewer times the running sum varied from AVG by less than 2x AVG, the more straight the line appeared. Many ideal parts would only show 8 or less delta points where the sumOfDiffs would stray from the AVG very far.
float sumOfDiffs=0.0;
for( int idx=0; idx<stop; idx++ ){
float spread = deltas[idx] - line->AdcAvgRise;
sumOfDiffs = sumOfDiffs + spread;
...
testVal = 2*line->AdcAvgRise;
if( sumOfDiffs > testVal || sumOfDiffs < -testVal ){
flag = 'S';
}
...
}
And then a part with a fantastic linear plot came through with 58 data points where sumOfDiffs was more than twice the AVG! I find this amazing, as at the end of the ~120 data points, sumOfDiffs value is -0.000057.
During testing, the final sumOfDiffs result would often register as 0.000000 and only on exceptionally bad parts would it be greater than .000100. I found this quite surprising, actually: how a "bad part" can have accumulated great accuracy.
Sample output from monitoring sumOfDiffs This below output shows a dip happening. The test watches as the running sumOfDiffs is more than 2x the AVG away from the AVG for the whole test. This dip lasts from deltas idx of 23 through 49; starts at 17.25° and lasts for 19.5°.
Avg rise: 6.75 Std dev: 2.577
idx: delta diff from avg sumOfDiffs Flag
23: 5 -1.75 -14.05 S
24: 6 -0.75 -14.80 S
25: 7 0.25 -14.55 S
26: 5 -1.75 -16.30 S
27: 3 -3.75 -20.06 S
28: 3 -3.75 -23.81 S
29: 7 0.25 -23.56 S
30: 4 -2.75 -26.31 S
31: 2 -4.75 -31.06 S
32: 8 1.25 -29.82 S
33: 6 -0.75 -30.57 S
34: 9 2.25 -28.32 S
35: 8 1.25 -27.07 S
36: 5 -1.75 -28.82 S
37: 15 8.25 -20.58 S
38: 7 0.25 -20.33 S
39: 5 -1.75 -22.08 S
40: 9 2.25 -19.83 S
41: 10 3.25 -16.58 S
42: 9 2.25 -14.34 S
43: 3 -3.75 -18.09 S
44: 6 -0.75 -18.84 S
45: 11 4.25 -14.59 S
47: 3 -3.75 -16.10 S
48: 8 1.25 -14.85 S
49: 8 1.25 -13.60 S
Final Sum of diffs: 0.000030
RunningStats analysis:
NumDataValues= 125
Mean= 6.752
StandardDeviation= 2.577
Skewness= 0.251
Kurtosis= -0.277
Sobering note about quality: what started me on this journey was learning how major automotive OEM suppliers consider a 4 point test to be the standard measure for these parts. My first test bench used an Arduino with 8k of RAM, didn't have a TFT display nor a printer, and a mechanical resolution of only 3°! Back then I simply tested deltas being within arbitrary total bounds and choosing a limit of how big any single delta could be. My 120+ point test feels high class compared to that 30 point test from before, but that test had no idea about these dips.
Premises
the mean of a set of data has the mathematical property that the sum of the deviations from the mean is 0.
this explains why both bad and good datasets alwais give almost 0.
basically the result when differs from zero is essentially an accumulations of rounding errors in the diffs and that's why unfortunately cannot hold useful informations
the thing that most clearly define what you're looking for is your image: you're looking for an AREA and this is why you're not finding the solution in this ways:
looking to a metric in the single points is too local to extract that information
looking to global accumulations or parameters (global standard deviation) is too global and you lose the data among too much information and source of variations
kurtosis (you've already told I know but is for completeness) is out of its field of applications since this is not a probability distribution
in the end the more suitable approach of your already tryied ones is the "Homemade dip detector" because thinks in a way that is local but not too much.
Last but not least:
Any Algorithm you're going to choose has its tacit points on which it stands.
So maybe one is looking for a super clever algorithm that with no parametrization and tuning automatically adapts to the problem and self define thereshods and other.
On the other side there is an algorithm that will stand on the knowledge by the writer of the tipical data behavior (good and bad) and that is itself stupid in the way that if there is another different and unespected behavior the results are unpredictable
Ok, the right way is one of this two or is in-between them depending on the application. So if it works also the "Homemade dip detectors" can be a solution. There is not reason to define it crude but it could be that is not sufficient based on applicaton needs and that's an other thing.
How to find the area
Once you have the data the first thing is to clearly define the "theoretical straight line". I give some options:
use RANSAC algorithm (formally the best option IMHO)
this give you the best fit to the aligned points disregarding the not aligned ones
it is quite difficult and maybe oversized for this work (IMHO)
consider the line defined by the first and last point
you told that the dip is almost always in the same position that is not near boundaries so first and last points can be thought as affordable
very easy to implement
this is an example of using the knowledge about expected behaviors as I told before so you need to think if and how much confidence you give to this assumption
consider a linear fit to the first 10 points and last 10 points
is only a more affordable version of previous since using more points you can be less worried that maybe just the first point or the last were affected by any measure problem and so all fails because of this
also quite easy to implement
if I were you I will use this or something inspired to this
calculate the Y value given by the straight line for each X
calculate the area between the two curves (or the areas under the function Y_dev = Y_data - Y_straight that is mathematically the same) with this procedure:
PositiveMax = 0; NegativeMax = 0;
start from first point (value can be positive or negative) and put in a temporary area accumulator tmp_Area
for each next point
if the sign is the same then accumulate the value
if it is different
stop accumulating
check if the accumulated value is the greater than PositiveMax or below NegativeMax and if it is than store as new PositiveMax or NegativeMax
in any case reset the accumulator with tmp_Area = Y_dev; to the current value starting this way a new accumulation
in the end you will have the values of the maximum overvalued contiguous area and maximum undervalued contiguous area that I think are the scores you're looking for.
if you want you can only manage the NegativeMax based on observed and expected data behaviors
you may find useful to put a thereshold so that if a value Y_dev is lower than the thereshold you do not accumulate it.
this in order to not obtain large accumulations from many points close to the straight line that can be similar to the accumulations of few points far from the line
the need of this and and the proper thereshold needs to be evaluated on some sample data
you need to find an appropriate thereshold for this contiguous area and you can have it only from observation of sample data.
again: it can be you observing and deciding the thereshold or you can build a repository of good and bad samples and write a program that automatically learn which thereshold to use. But his is not the algorithm, this is how to find its operative parameters and there is nothing wrong to do by human brain.. ..it only depends if we're looking for a method to separate bad and good things or if we're looking for and autoadaptive algorithm that does this.. ..you decide the target.
It turns out the result of my gut feeling and Diego's method is an average of the integral. I still don't like that name, so I have described the algorithm and have asked on Math.SE what to call this, which got migrated to "Cross Validated", Stats.SE .
I Updated graphs after a massive edit of my Math.SE question. It turns out I'm taking the average of a closed integral of the derivative of the data. :P First, we gather the data:
Next is the "derivative": step through the original data array to form the deltas array which is the rise of ADC values from one 0.75° step to the next. "Rise" or "slope" is what the derivative is: dy/dx.
With the "slope" or average leveled out, I can find multiple negative deltas in a row, sum them, then divide by the count at the end of the dip. The sum is an integral of the area between average and the deltas and when the dip goes back positive, I can divide the sum by the count of the dips.
During testing, I came up with a cutoff value for this average of the integral at 2.6. That was a great measure of my "gut instinct" looking at the plot thinking a part was good or bad.
In case someone else finds themselves trying to quantify this, here's the code I implemented. Note that it is only looking for negative dips. Also, dipCountLimit is defined elsewhere as 5. In addition to the dip detector/accumulator (ie Numerical Integrator) I also have a spike detector that arbitrarily flags the test as bad if any data points stray from the average by the amount of average + standard deviation. AVG+STD DEV as a spike limit was chosen arbitrarily based on the observed plots of the parts it would fail.
int dipdx=0;
// inDipFlag also counts the length of this dip
int inDipFlag=0;
float dips[140] = { 0.0 };
for( int idx=0; idx<stop; idx++ ){
const float diffFromAvg = deltas[idx] - line->AdcAvgRise;
// state machine to monitor dips
const int _stop = stop-1;
if( diffFromAvg < 0 && idx < _stop ) {
// check NEXT data point for negative diff & set dipFlag to put state in dip
const float nextDiff = deltas[idx+1] - line->AdcAvgRise;
if( nextDiff < 0 && inDipFlag == 0 )
inDipFlag = 1;
// already IN a dip, and next diff is negative
if( nextDiff < 0 && inDipFlag > 0 ) {
inDipFlag++;
}
// accumulate this dip
dips[dipdx]+= diffFromAvg;
// next data point ends this dip and we advance dipdx to next dip
if( inDipFlag > 0 && nextDiff > 0 ) {
if( inDipFlag < dipCountLimit ){
// reset the accumulator, do not advance dipdx to next entry
dips[dipdx]=0.0;
} else {
// change this entry's value from dip sum to its ratio
dips[dipdx] = -dips[dipdx]/inDipFlag;
// advance dipdx to next entry
dipdx++;
}
// Next diff isn't negative, so the dip is done
inDipFlag = 0;
}
}
}