Does anyone know of an algorithm or data structure relating to selecting items, with a probability of them being selected proportional to some attached value? In other words: http://en.wikipedia.org/wiki/Sampling_%28statistics%29#Probability_proportional_to_size_sampling
The context here is a decentralized reputation system and the attached value is therefore the value of trust one user has in another. In this system all nodes either start as friends which are completely trusted or unknowns which are completely untrusted. This isn't useful by itself in a large P2P network because there will be many more nodes than you have friends and you need to know who to trust in the large group of users that aren't your direct friends, so I've implemented a dynamic trust system in which unknowns can gain trust via friend-of-a-friend relationships.
Every so often each user will select a fixed number (for the sake of speed and bandwidth) of target nodes to recalculate their trust based on how much another selected fixed number of intermediate nodes trust them. The probability of selecting a target node for recalculation will be inversely proportional to its current trust so that unknowns have a good chance of becoming better known. The intermediate nodes will be selected in the same way, except that the probability of selection of an intermediary is proportional to its current trust.
I've written up a simple solution myself but it is rather slow and I'd like to find a C++ library to handle this aspect for me. I have of course done my own search and I managed to find TRSL which I'm digging through right now. Since it seems like a fairly simple and perhaps common problem, I would expect there to be many more C++ libraries I could use for this, so I'm asking this question in the hope that someone here can shed some light on this.
This is what I'd do:
int select(double *weights, int n) {
// This step only necessary if weights can be arbitrary
// (we know total = 1.0 for probabilities)
double total = 0;
for (int i = 0; i < n; ++i) {
total += weights[i];
}
// Cast RAND_MAX to avoid overflow
double r = (double) rand() * total / ((double) RAND_MAX + 1);
total = 0;
for (int i = 0; i < n; ++i) {
// Guaranteed to fire before loop exit
if (total <= r && total + weights[i] > r) {
return i;
}
total += weights[i];
}
}
You can of course repeat the second loop as many times as you want, choosing a new r each time, to generate multiple samples.
Related
I have an application (Qt but that is not really important) which is downloading several files and I want to notify the user about the progress. The c++ app runs on a different machine and progress reports are sent over network (protocoll does not matter here). I do not want to sent for each data receival a message over the network but only in defined intervalls e.g. every 5% (so 0%, 5%, 10%).
Basically I have it like this right now:
void Downloader::OnUpdateDownloadProgress(int downloaded_bytes)
{
m_files_downloaded_size += downloaded_bytes;
int perc_download = (int) ((m_files_downloaded_size / m_files_total_size)*100);
if(m_percentage_buffer > LocalConfig::getDownloadReportSteps() || m_files_downloaded_size == m_files_total_size){
emit sigDownloadProgress(DOWNLOAD_PROGRESS, perc_download);
m_percentage_buffer = 0;
}else{
m_percentage_buffer += (downloaded_bytes / m_files_total_size) * 100;
}
}
Which means that for each data receival triggering this slot I need to perform:
greater comparison, addition , division, multiplication
I know that I could at least skimp on the multiplication by storing a float in the settings and comparing to that. Other than that are there any ways to get this more performant or did I do good on my first try implementing?
I have a scheduler, endlessly executing n actions. Each action is scheduled for x seconds into the future. When an action completes, it is re-scheduled for another x seconds into the future after its previously scheduled time. Every 1s, the scheduler "ticks", executing at most 25 actions which are due to fire. Actions may take a second or so to complete (though this value should be considered variable and unpredictable).
Say that x is 60 seconds. Due to the throttling of at most 25 actions being executed simultaneously, when n grows large, it is conceivable that the scheduler won't have time to execute all n actions within a 60 second window, and actions will be executed later and later as time goes on. This is undesirable, as it'll become true that there are actions to execute on every single tick and this increases load on my system. It's less important to me to keep x exactly constant than it is to keep load down.
So I wish to implement an adaptive "handicap", an automatically-applied fudge factor h, increasing it when a majority of actions are executed "late", and decreasing it (edging it back to its default of zero) when they're all seemingly and consistently on time. The scheduler would then be made to schedule actions for x+h seconds' time, rather than x.
At a high level, how would you approach this? How would you define "a majority of actions are executed 'late'" and how would you represent/detect it in C++03 code?
Better yet, is there an existing well-known approach that objectively "works" here?
To be clear, you are aiming to avoid sustained high load where there are tasks
every tick, rather than aiming to minimise the scheduling delay.
Correspondingly, the metric you should be looking at when considering the fudge
factor is the load, not the lateness.
If you have full knowledge of the system — the number of tasks, their
rescheduling intervals, the distribution of their execution time —
you could in principle exactly solve for a handicap value that would give you
a mean target load when busy, or would say, only exceed the target load
10% of the time when busy, or so on.
On the other hand, if this information is not available or predictable,
you will need an adaptive approach.
The general theory for this sort of thing is control theory, which can get
quite involved. Broadly though the heuristic is: if the load is less than the
threshold, and we have a positive handicap, reduce the handicap; if the load is
over the threshold, increase the handicap.
The handicap should be proportional, rather than additional: if, for example,
we knew we were consistently 10% overloaded, then we'd be right on target if we
applied a proportional delay of 10% on the scheduling of jobs. That is, we're
looking to apply a handicap factor h such that jobs are scheduled at xh
seconds time instead of x. A factor of 1 would correspond to no handicap.
When we're overloaded, but not maximally overloaded, the response then is linear
in the log: log(h) = log(load) - log(load_target). So the simplest method
would be:
load = get_current_load();
if (load>load_target) h = load/load_target;
else h = 1.0;
Unfortunately, there is a maximum measured load, and linearity breaks down
here. The linear model can be extended to incorporate the accumulated
deviation from the target load, and the rate of change of the load.
This corresponds to the proportional-integral-derivative controller.
As this is a noisy environment (there is variation in the action
execution times), it might be wise to shy away from the derivative bit
of this model, and stick with the proportional-integral (PI) part.
When this model is discretized, we get an expression for log(h)
that is proportional to the current (log) overload, plus a term that
captures how badly we've been doing:
load = get_current_load();
deviation = load > load_target ? log(load/load_target) : 0;
accum += p1 * deviation;
log_h = p2 * deviation + accum;
h = log_h < 0 ? 1.0 : exp(log_h);
Except, we don't have a symmetric problem: when we're below
the load target, but the accumulated error term stays high.
We could work around it by accumulating negative deviations
as well, but limiting the accumulated error to be at least
non-negative, so that a period of legitimately low load
doesn't give us a free pass for later:
load = get_current_load();
if (load > 0) {
deviation = log(load/load_target);
accum += p1 * deviation;
if (accum < 0) accum = 0;
if (deviation < 0) deviation = 0;
}
else {
accum = 0;
deviation = 0;
}
log_h = p2 * deviation + accum;
h = log_h < 0 ? 1.0 : exp(log_h);
The value for p2 will be somewhere (roughly) between 0.5 and 0.9,
to leave some room for the influence of the accumulated error.
A good value for p1 will be probably be around 0.3 to 0.5 times
the reciprocal of the lag time, the number of steps it takes for a change
in h to present itself as a change in load. This can be estimated
by the mean rescheduling time of the actions.
You can play around with these parameters to get the sort of
response you'd like, or you can make a more faithful mathematical
model of your scheduling problem and then do maths to it!
The parameters themselves can also be modified adaptively over
time, based on the observed response to changes in load.
(Warning, I haven't actually tried these fragments in a mock scheduler!)
This is a follow-up to Fast percentile in C++
I have a sorted array of 365 daily cashflows (xDailyCashflowsDistro) which I randomly sample 365 times to get a generated yearly cashflow. Generating is carried out by
1/ picking a random probability in the [0,1] interval
2/ converting this probability to an index in the [0,364] interval
3/ determining what daily cashflow corresponds to this probability by using the index and some linear aproximation.
and summing 365 generated daily cashflows. Following the previously mentioned thread, my code precalculates the differences of sorted daily cashflows (xDailyCashflowDiffs) where
xDailyCashflowDiffs[i] = xDailyCashflowsDistro[i+1] - xDailyCashflowsDistro[i]
and thus the whole code looks like
double _dIdxConverter = ((double)(365 - 1)) / (double)(RAND_MAX - 1);
for ( unsigned int xIdx = 0; xIdx < _xCount; xIdx++ )
{
double generatedVal = 0.0;
for ( unsigned int xDayIdx = 0; xDayIdx < 365; xDayIdx ++ )
{
double dIdx = (double)fastRand()* _dIdxConverter;
long iIdx1 = (unsigned long)dIdx;
double dFloor = (double)iIdx1;
generatedVal += xDailyCashflowsDistro[iIdx1] + xDailyCashflowDiffs[iIdx1] *(dIdx - dFloor);
}
results.push_back(generatedVal) ;
}
_xCount (the number of simulations) is 1K+, usually 10K.
The problem:
This simulation is being carried out 15M times (compared to 100K when the first thread was written) at the moment, and it takes ~10 minutes on a 3.4GHz machine. Due to the nature of problem, this 15M is unlikely to be significantly lowered in the future, only increased. Having used VTune Analyzer, I am being told that the last but one line (generatedVal += ...) generates 80% of runtime. And my question is why and how I can work with that.
Things I have tried:
1/ getting rid of the (dIdx - dFloor) part to see whether double difference and multiplication is the main culprit - runtime dropped by a couple of percent
2/ declaring xDailyCashflowsDistro and xDailyCashflowDiffs as __restict so as to prevent the compiler thinking they are dependendent on each other - no change
3/ tried using 16 days (as opposed to 365) to see whether it is cache misses that drag my performance - not a slight change
4/ tried using floats as opposed to doubles - no change
5/ compiling with different /fp: - no change
6/ compiling as x64 - has effect on the double <-> ulong conversions, but the line in question is unaffected
What I am willing to sacrifice is resolution - I do not care whether the generatedVal is 100010.1 or 100020.0 at the end if the speed gain is substantial.
EDIT:
The daily/yearly cashflows are related to the whole portfolio. I could divide all daily cashflows by portflio size and would thus (at 99.99% confidence level) ensure that daily cashflows/pflio_size will not reach out of the [-1000,+1000] interval. In this case, though, I would need precision to the hundredths.
Perhaps you could turn your piecewise linear function into a piecewise-linear "histogram" of its values. The number you're sampling appears to be the sum of 365 samples from that histogram. What you're doing is a not-particularly-fast way to sample from the sum of 365 samples from that histogram.
You might try computing a Fourier (or wavelet or similar) transform, keeping only the first few terms, raising it to the 365th power, and computing the inverse transform. You won't get a probability distribution in the end, but there shouldn't be "too much" mass below 0 or above 1 and the total mass shouldn't be "too different" from 1 with this technique. (I don't know what your data looks like; this technique may well be unworkable for good mathematical reasons.)
I have raw 16bit 48khz pcm data. I need to strip all data which is out of the range of human hearing.
For now I'm just doing a sum of all samples and then dividing by the sample count to calculate peak sound level, but I need to reduce false positives.
I have big peak level all the time, speaking and other sounds which I can hear increasing levels just a little, so I need to implement some filtering. I am not familiar with sound processing at all, so currently I am not using any filtering because I do not understand how to create it. My current code looks like this:
for(size_t i = 0; i < buffer.size(); i++)
level += abs(buffer[i]);
level /= buffer.size();
How can I implement this kind of filtering using C++?
Use a band pass filter.
A band-pass filter is a device that passes frequencies within a
certain range and rejects (attenuates) frequencies outside that range.
This sounds like exactly the sort of filter you are looking for.
I had a quick google search and found this thread that discusses implementation in C++.
It sounds like you want to do something (maybe start recording) if the sound level goes above a certain threshold. This is sometimes called a "gate". It also sounds like you are having trouble with false positives. This is sometimes handled with a "side-chain" applied to the gate.
The general principle of a gate is create an envelope of your signal, and then monitor the envelope to discover when it goes above a certain threshold. If it is above the threshold, your gate is "on", if not, your gate is "off". If you treat your signal before creating the envelope in some way to make it more or less sensitive to various parts of your signal/noise the treatment is called a "side-chain".
You will have to discover the details on your own because there is too much for a Q&A website, but maybe this is enough of a start:
float[] buffer; //defined elsewhere
float HOLD = .9999 ; //there are precise ways to compute this, but experimentation might work fine
float THRESH = .7 ; //or whatever
float env = 0; //we initialize to 0, but in real code be sure to save this between runs
for(size_t i = 0; i < buffer.size(); i++) {
// side-chain, if used, goes here
float b = buffer[i];
// create envelope:
float tmp = abs(b); // you could also do buffer[i] * buffer[i]
env = env * HOLD + tmp * (1-HOLD);
// threshold detection
if( env > THRESH ) {
//gate is "on"
} else {
//gate is "off"
}
}
The side-chain might consist of filters like an eq. Here is a tutorial on designing audio eq: http://blog.bjornroche.com/2012/08/basic-audio-eqs.html
I am using the spatialindex library from http://libspatialindex.github.com/
I am creating an R* tree in the main memory:
size_t capacity = 10;
bool bWriteThrough = false;
fileInMem = StorageManager
::createNewRandomEvictionsBuffer(*memStorage, capacity, bWriteThrough);
double fillFactor = 0.7;
size_t indexCapacity = 10;
size_t leafCapacity = 10;
size_t dimension = 2;
RTree::RTreeVariant rv = RTree::RV_RSTAR;
tree = RTree::createNewRTree(*fileInMem, fillFactor, indexCapacity,
leafCapacity, dimension, rv, indexIdentifier);
Then I am inserting a large number of bounding boxes, currently some 2.5M (road network of Bavaria in Germany). Later I'll aim at inserting all roads of Europe.
What are good choice of parameters for the storage manager and rtree? Mostly I am using the rtree to find the closest roads to a given query (bbox intersection).
As your data is static, a good bulk load may work for you. The most popular (and a rather simple) bluk load is Sort-Tile-Recursive. However, it is somewhat designed around point data. As you are inserting spatial objects, it may or may not work as well.
If you are using a bulk load, it will no longer be an R*-tree, but a plain R-tree.
Capacity 10 sounds way too little to me. You want a much larger fan-out. But you'll need to benchmark, this is data set and query dependant what is good. I'd definitely try 100 or more.