Fast percentile in C++ - speed more important than precision - c++

This is a follow-up to Fast percentile in C++
I have a sorted array of 365 daily cashflows (xDailyCashflowsDistro) which I randomly sample 365 times to get a generated yearly cashflow. Generating is carried out by
1/ picking a random probability in the [0,1] interval
2/ converting this probability to an index in the [0,364] interval
3/ determining what daily cashflow corresponds to this probability by using the index and some linear aproximation.
and summing 365 generated daily cashflows. Following the previously mentioned thread, my code precalculates the differences of sorted daily cashflows (xDailyCashflowDiffs) where
xDailyCashflowDiffs[i] = xDailyCashflowsDistro[i+1] - xDailyCashflowsDistro[i]
and thus the whole code looks like
double _dIdxConverter = ((double)(365 - 1)) / (double)(RAND_MAX - 1);
for ( unsigned int xIdx = 0; xIdx < _xCount; xIdx++ )
{
double generatedVal = 0.0;
for ( unsigned int xDayIdx = 0; xDayIdx < 365; xDayIdx ++ )
{
double dIdx = (double)fastRand()* _dIdxConverter;
long iIdx1 = (unsigned long)dIdx;
double dFloor = (double)iIdx1;
generatedVal += xDailyCashflowsDistro[iIdx1] + xDailyCashflowDiffs[iIdx1] *(dIdx - dFloor);
}
results.push_back(generatedVal) ;
}
_xCount (the number of simulations) is 1K+, usually 10K.
The problem:
This simulation is being carried out 15M times (compared to 100K when the first thread was written) at the moment, and it takes ~10 minutes on a 3.4GHz machine. Due to the nature of problem, this 15M is unlikely to be significantly lowered in the future, only increased. Having used VTune Analyzer, I am being told that the last but one line (generatedVal += ...) generates 80% of runtime. And my question is why and how I can work with that.
Things I have tried:
1/ getting rid of the (dIdx - dFloor) part to see whether double difference and multiplication is the main culprit - runtime dropped by a couple of percent
2/ declaring xDailyCashflowsDistro and xDailyCashflowDiffs as __restict so as to prevent the compiler thinking they are dependendent on each other - no change
3/ tried using 16 days (as opposed to 365) to see whether it is cache misses that drag my performance - not a slight change
4/ tried using floats as opposed to doubles - no change
5/ compiling with different /fp: - no change
6/ compiling as x64 - has effect on the double <-> ulong conversions, but the line in question is unaffected
What I am willing to sacrifice is resolution - I do not care whether the generatedVal is 100010.1 or 100020.0 at the end if the speed gain is substantial.
EDIT:
The daily/yearly cashflows are related to the whole portfolio. I could divide all daily cashflows by portflio size and would thus (at 99.99% confidence level) ensure that daily cashflows/pflio_size will not reach out of the [-1000,+1000] interval. In this case, though, I would need precision to the hundredths.

Perhaps you could turn your piecewise linear function into a piecewise-linear "histogram" of its values. The number you're sampling appears to be the sum of 365 samples from that histogram. What you're doing is a not-particularly-fast way to sample from the sum of 365 samples from that histogram.
You might try computing a Fourier (or wavelet or similar) transform, keeping only the first few terms, raising it to the 365th power, and computing the inverse transform. You won't get a probability distribution in the end, but there shouldn't be "too much" mass below 0 or above 1 and the total mass shouldn't be "too different" from 1 with this technique. (I don't know what your data looks like; this technique may well be unworkable for good mathematical reasons.)

Related

c++ Create time remaining estimate when data calcs get progressively longer?

I'm adding items to a list, so each insert takes just a bit longer than the last (this is a requirement, assume you can't change that). I've manually timed a sample dataset on MY computer but I want a generalized way to predict the time on any computer, and given ANY dataset size.
In my flailing around trying to figure this out, what i have collected is a vector, 100 long, of "how long 1/100th the sample data" took. So in my example data set i have 237,965 objects, which means in the vector of times i collected, each bucket tells how long it took to add 2,379 items.
Here's a link to the sample data of 100 items. So you can see the first 2k items took about 8 seconds, and the last 2k items took about 101 seconds. All together, if you add all the time, that's 4,295 seconds or about 1 hr 11 minutes.
So my question is, given this data set, and using it for future predictions, how do i estimate the remaining time when adding different size data?
In more flailing, i made some plots, wondering if it could help. First plot is just the raw data on a log graph:
I then made a 2nd data set based on first, this time showing accumulated time, rather than just the time for the current slice, and plotted that on a linear graph:
Notice the lovely trend line formula? That MUST be something that i just need to somehow plug into my code but i can't for the life of me figure out how.
Should i have instead gathered the data into time-slices and not index-slices? ie: i KNOW this data takes 1:10 to load, so take snapshots every 1/100th of that duration, instead of snapshotting every 1/100th of the data set?
Or HOW do i figure this out?
the function I need to write has this API:
CFTimeInterval get_estimated_end_time(int maxI, int curI, CFTimeInterval elapsedT);
so given only those three variables (maxI, curI, and elapsedT), and knowing the trend line formula from above, i need to return "duration until maxI" (seconds).
Any ideas?
Update:
well it seems after much futzing around, i can just do this (note "LERP" is just linear interpolate):
#define kDataSetMax 237965
double FunctionX(int in_x)
{
double _x(LERP(0, 100, in_x, 0, i_maxI));
double resultF =
(0.32031139888898874 * math_square(_x))
+ (9.609731568497784 * _x)
- (7.527252350031663);
if (resultF <= 1) {
resultF = 1;
}
return resultF;
}
CFTimeInterval get_estimated_end_time(int maxI, int curI, CFTimeInterval elapsedT)
{
CFTimeInterval endT(FunctionX(maxI));
return remainingT;
}
But that means i'm just ignoring curI and elapsedT?? That doesn't seem... right? What am I missing?
Footnotes:
#define LERP(to_min, to_max, from, from_min, from_max) \
((from_max) == (from_min) ? from : \
(double)(to_min) + ((double)((to_max) - (to_min)) \
* ((double)((from) - (from_min)) \
/ (double)((from_max) - (from_min)))))
#define LERP_PERCENT(from, from_max) \
LERP(0.0f, 1.0f, from, 0.0f, from_max)
Your FunctionX is most of the way there. It's currently calculating expectedTimeToReachMaxIOnMyMachine. What you need to do is figure out how much slower the current time is relative to the expected on your machine to reach this same point, and then extrapolate that same ratio to the maximum time.
CFTimeInterval get_estimated_end_time(int maxI, int curI, CFTimeInterval elapsedT) {
//calculate how long we expected it to take to reach this point
CFTimeInterval expectedTimeToReachCurrentIOnMyMachine = FunctionX(curI);
//calculate how much slower we are than the expectation
//if this machine is faster, the math still works out.
double slowerThanExpectedByRatio
= double(elapsedT) / expectedTimeToReachCurrentIOnMyMachine;
//calculate how long we expected to reach the max
CFTimeInterval expectedTimeToReachMaxIOnMyMachine = FunctionX(maxI);
//if we continue to be the same amount slower, we'll reach the max at:
CFTimeInterval estimatedTimeToReachMaxI
= expectedTimeToReachMaxIOnMyMachine * slowerThanExpectedByRatio;
return estimatedTimeToReachMaxI;
}
Note that a smart implementation can cache and reuse expectedTimeToReachMaxIOnMyMachine and not calculate it every time.
Basically this assumes that after doing X% of the work, we can calculate how much slower we were than the expected curve, and assume we will stay approximately that same amount slower than the expected curve.
In the example below, the expected time taken is the blue line. At 4000 elements, we see that the expected time on your machine was 8,055,826, But the actual time taken on this machine was 10,472,573, which is 30% higher (slowerThanExpectedByRatio=1.3). At that point, we can extrapolate that we'll probably remain 30% higher throughout the entire process (the purple line). So if the total expected time on your machine for 10000 elements was 32,127,229, then our total estimated time on this machine for 10000 will be 41,765,398 (30% higher)

Why is there a difference between the sum (stime + utime) of all processes' CPU usage, compared to the overall CPU usage from /proc/stat in Linux?

I need to calculate the overall CPU usage of my Linux device over some time (1-5 seconds) and a list of processes with their respective CPU usage times. The programm should be designed and implemented in C++. My assumption would be that the sum of all process CPU times would be equal to the total value for the whole CPU. For now the CPU I am using is multi-cored (2 cores).
According to How to determine CPU and memory consumption from inside a process? it is possible to calculate all "jiffies" available in the system since startup using the values for "cpu" in /proc/stat. If you now sample the values at two points in time and compare the values for user, nice, system and idle at the two time points, you can calculate the average CPU usage in this interval. The formula would be
totalCPUUsage = ((user_aft - user_bef) + (nice_aft - nice_bef) + (system_aft - system_bef)) /
((user_aft - user_bef) + (nice_aft - nice_bef) + (system_aft - system_bef) + (idle_aft - idle_bef)) * 100 %
According to How to calculate the CPU usage of a process by PID in Linux from C? the used jiffies for a single process can be calculated by adding utime and stime from /proc/${PID}/stat (column 14 and 15 in this file). When I now calculate this sum and divide it by the total amount of jiffies in the analyzed interval, I would assume the formula for one process to be
processCPUUsage = ((process_utime_aft - process_utime_bef) + (process_stime_aft - process_stime_bef)) /
((user_aft - user_bef) + (nice_aft - nice_bef) + (system_aft - system_bef) + (idle_aft - idle_bef)) * 100 %
When I now sum up the values for all processes and compare it to the overall calculated CPU usage, I receive a slightly higher value for the aggregated value most of the time (although the values are quite close for all different CPU loads).
Can anyone explain to me, what's the reason for that? Are there any CPU resources that are used by more than one process and thus accounted twice or more in my accumlation? Or am I simply missing something here? I can not find any further hint in the Linux man page for the proc file system (https://linux.die.net/man/5/proc) as well.
Thanks in advance!

How would you implement this adaptive 'fudge factor' in a scheduler?

I have a scheduler, endlessly executing n actions. Each action is scheduled for x seconds into the future. When an action completes, it is re-scheduled for another x seconds into the future after its previously scheduled time. Every 1s, the scheduler "ticks", executing at most 25 actions which are due to fire. Actions may take a second or so to complete (though this value should be considered variable and unpredictable).
Say that x is 60 seconds. Due to the throttling of at most 25 actions being executed simultaneously, when n grows large, it is conceivable that the scheduler won't have time to execute all n actions within a 60 second window, and actions will be executed later and later as time goes on. This is undesirable, as it'll become true that there are actions to execute on every single tick and this increases load on my system. It's less important to me to keep x exactly constant than it is to keep load down.
So I wish to implement an adaptive "handicap", an automatically-applied fudge factor h, increasing it when a majority of actions are executed "late", and decreasing it (edging it back to its default of zero) when they're all seemingly and consistently on time. The scheduler would then be made to schedule actions for x+h seconds' time, rather than x.
At a high level, how would you approach this? How would you define "a majority of actions are executed 'late'" and how would you represent/detect it in C++03 code?
Better yet, is there an existing well-known approach that objectively "works" here?
To be clear, you are aiming to avoid sustained high load where there are tasks
every tick, rather than aiming to minimise the scheduling delay.
Correspondingly, the metric you should be looking at when considering the fudge
factor is the load, not the lateness.
If you have full knowledge of the system — the number of tasks, their
rescheduling intervals, the distribution of their execution time —
you could in principle exactly solve for a handicap value that would give you
a mean target load when busy, or would say, only exceed the target load
10% of the time when busy, or so on.
On the other hand, if this information is not available or predictable,
you will need an adaptive approach.
The general theory for this sort of thing is control theory, which can get
quite involved. Broadly though the heuristic is: if the load is less than the
threshold, and we have a positive handicap, reduce the handicap; if the load is
over the threshold, increase the handicap.
The handicap should be proportional, rather than additional: if, for example,
we knew we were consistently 10% overloaded, then we'd be right on target if we
applied a proportional delay of 10% on the scheduling of jobs. That is, we're
looking to apply a handicap factor h such that jobs are scheduled at xh
seconds time instead of x. A factor of 1 would correspond to no handicap.
When we're overloaded, but not maximally overloaded, the response then is linear
in the log: log(h) = log(load) - log(load_target). So the simplest method
would be:
load = get_current_load();
if (load>load_target) h = load/load_target;
else h = 1.0;
Unfortunately, there is a maximum measured load, and linearity breaks down
here. The linear model can be extended to incorporate the accumulated
deviation from the target load, and the rate of change of the load.
This corresponds to the proportional-integral-derivative controller.
As this is a noisy environment (there is variation in the action
execution times), it might be wise to shy away from the derivative bit
of this model, and stick with the proportional-integral (PI) part.
When this model is discretized, we get an expression for log(h)
that is proportional to the current (log) overload, plus a term that
captures how badly we've been doing:
load = get_current_load();
deviation = load > load_target ? log(load/load_target) : 0;
accum += p1 * deviation;
log_h = p2 * deviation + accum;
h = log_h < 0 ? 1.0 : exp(log_h);
Except, we don't have a symmetric problem: when we're below
the load target, but the accumulated error term stays high.
We could work around it by accumulating negative deviations
as well, but limiting the accumulated error to be at least
non-negative, so that a period of legitimately low load
doesn't give us a free pass for later:
load = get_current_load();
if (load > 0) {
deviation = log(load/load_target);
accum += p1 * deviation;
if (accum < 0) accum = 0;
if (deviation < 0) deviation = 0;
}
else {
accum = 0;
deviation = 0;
}
log_h = p2 * deviation + accum;
h = log_h < 0 ? 1.0 : exp(log_h);
The value for p2 will be somewhere (roughly) between 0.5 and 0.9,
to leave some room for the influence of the accumulated error.
A good value for p1 will be probably be around 0.3 to 0.5 times
the reciprocal of the lag time, the number of steps it takes for a change
in h to present itself as a change in load. This can be estimated
by the mean rescheduling time of the actions.
You can play around with these parameters to get the sort of
response you'd like, or you can make a more faithful mathematical
model of your scheduling problem and then do maths to it!
The parameters themselves can also be modified adaptively over
time, based on the observed response to changes in load.
(Warning, I haven't actually tried these fragments in a mock scheduler!)

Want to change a value continuously from min to max to min in a loop as a Sine curve

I am working on a game where I need an algorithm to vary a value in a loop. I have implemented the algorithm but I guess its not working as I want it to work. Here's what I want and what I have already implemented :
Given :
a commodity whose price I want to circulate (from min to max to min again and continuously in a loop)
I am using cocos2d-x (C++) where I have a scheduler which runs a function at a given interval say SCHEDULE_INTERVAL
MIN_PRICE and MAX_PRICE of the commodity
currentPrice
Time duration which it will take to complete one cycle (min-max-min)
Current Implementation :
SCHEDULE_INTERVAL = 0.3 (sec) (so the function is running every 0.3 secs)
counter = 0;
timeDuration = time to complete one cycle
function
{
counter++;
_amplitude = (maxPrice - minPrice)/2;
_midValue = (maxPrice + minPrice)/2;
currentPrice = _midValue + _amplitude * sin (2*PI*counter/timeDuration)
}
why i am using sine wave : because at the peaks i want to make the transitions slow.
Problem : for some reasons its not behaving the way I want it to behave
I want to continuously change the currentPrice form minPrice-maxPrice-minPrice in timeDuration and the loop running at SCHEDULE_INTERVAL
please suggest any solutions.
Thanks :)
EDIT :
what's not working in the above implementation is that the values are not changing according to the 'timeDuration' variable
If the pseudocode you posted accurately mirrors the expressions you use in real code, you probably want to change the argument of sin to this:
2 * PI * (counter * SCHEDULE_INTERVAL) / timeDuration
counter is the number of executions, while timeDuration is (I presume) the desired length in seconds.
In other words, your units don't match - it's always worthwhile to perform a dimensional analysis when formulae don't work.

Selecting nodes with probability proportional to trust

Does anyone know of an algorithm or data structure relating to selecting items, with a probability of them being selected proportional to some attached value? In other words: http://en.wikipedia.org/wiki/Sampling_%28statistics%29#Probability_proportional_to_size_sampling
The context here is a decentralized reputation system and the attached value is therefore the value of trust one user has in another. In this system all nodes either start as friends which are completely trusted or unknowns which are completely untrusted. This isn't useful by itself in a large P2P network because there will be many more nodes than you have friends and you need to know who to trust in the large group of users that aren't your direct friends, so I've implemented a dynamic trust system in which unknowns can gain trust via friend-of-a-friend relationships.
Every so often each user will select a fixed number (for the sake of speed and bandwidth) of target nodes to recalculate their trust based on how much another selected fixed number of intermediate nodes trust them. The probability of selecting a target node for recalculation will be inversely proportional to its current trust so that unknowns have a good chance of becoming better known. The intermediate nodes will be selected in the same way, except that the probability of selection of an intermediary is proportional to its current trust.
I've written up a simple solution myself but it is rather slow and I'd like to find a C++ library to handle this aspect for me. I have of course done my own search and I managed to find TRSL which I'm digging through right now. Since it seems like a fairly simple and perhaps common problem, I would expect there to be many more C++ libraries I could use for this, so I'm asking this question in the hope that someone here can shed some light on this.
This is what I'd do:
int select(double *weights, int n) {
// This step only necessary if weights can be arbitrary
// (we know total = 1.0 for probabilities)
double total = 0;
for (int i = 0; i < n; ++i) {
total += weights[i];
}
// Cast RAND_MAX to avoid overflow
double r = (double) rand() * total / ((double) RAND_MAX + 1);
total = 0;
for (int i = 0; i < n; ++i) {
// Guaranteed to fire before loop exit
if (total <= r && total + weights[i] > r) {
return i;
}
total += weights[i];
}
}
You can of course repeat the second loop as many times as you want, choosing a new r each time, to generate multiple samples.