Statistical sampling of code [duplicate]

Statistical sampling of code [duplicate] - c++

This question already has answers here:
How to efficiently calculate a running standard deviation
(17 answers)
Closed 5 years ago.
This may be a very open ended question.
I have to quickly measure time of some section of code. I'm using the std::chrono::high_resolution_clock functionality. I have to run this code for many iterations and measure the duration.
So here is the problem: I can measure minimum and maximum duration values, and calculate average using the number of samples count. In this case, I only need to store 4 values. But I would also like to know how the data is distributed. Calculation of the standard deviation or histogram requires that all data points be stored. However, this will require either one giant initial data structure or dynamically growing data structure - both of which will affect the measured code on my embedded system.
Is there a way to calculate standard deviation for this sample using the standard deviation of the previous sample?

Calculation of the standard deviation or histogram requires that all data points be stored
That's trivially false. You can calculate a running standard deviation with Welford's algorithm, which just requires one extra variable besides the running mean and the current count of elements.
As for the histogram, you don't need to keep all the data - you just need to keep the counts for each bin, and increment the right bin each time you have a new sample. Of course for this simple approach to pay out you need to know in advance an expected range and number of bins. If this isn't possible, you can always start with small bins over a small range and scale the bins size (merging the adjacent bins) whenever you meet an element outside of the current range. Again, all this requires just a fixed quantity of memory (one integer for each bin and two values for the range).

Related

How do I split many variable-sized units of work into equal sized buckets?

Let's say I have 300-400 units of work, all with different sizes, where the size difference is quite large in some cases. Is it possible to split those up into a fixed number of buckets so I can balance the load across a fixed number of worker threads?

The problem that you describe is known as Multiprocessor Scheduling problem (which is similar to bin packing problem which is a generalisation of the knapsack problem). Finding the optimal scheduling is known to be NP-hard. Therefore there is no known polynomial time algorithm for finding an optimal scheduling.
A simple heuristic (non-optimal) algorithm is Longest Processing Time:
Sort the units of work, largest first
For each unit, place into bucket with earliest end time

scatterv and sendreceive of columns of a matrix in MPI [duplicate]

This question already has answers here:
MPI_Scatter - sending columns of 2D array
(3 answers)
Closed 8 years ago.
My program needs to scatter a matrix between processes. The matrix is represented in memory by a 1d array. In my first version I scattered the matrix between processes by rows. The processes sends to each others some rows of their local matrixes in order to calculate the result of the computations that each one needs to make in a proper way. The processes send these rows with the sendrecv function. Till that all works good.
Now it came up to my mind that if the matrix has much more columns than rows it will be a better idea to scatter the matrix by columns instead of rows in order to have less elements of the local matrixes to be sent by the processes and in this way improving the scalability of the program. The thing is...how can I scatter the matrix by columns? And then...how can I select the proper columns to be sent by the processes to each others?

If possible, try changing your 1d array from row major order to column major order, scatter it and perform the computation, recieve it and then change it back from column major order to row major order. Depending on your matrix, the cost of to and fro transformation might be greater than the savings obtained from parallelization along the columns. See boost::multi_array documentation ( http://www.boost.org/doc/libs/1_55_0b1/libs/multi_array/doc/user.html#sec_storage)

Neural Networks training on multiple cores

Straight to the facts.
My Neural network is a classic feedforward backpropagation.
I have a historical dataset that consists of:
time, temperature, humidity, pressure
I need to predict next values basing on historical data.
This dataset is about 10MB large therefore training it on one core takes ages. I want to go multicore with the training, but i can't understand what happens with the training data for each core, and what exactly happens after cores finish working.
According to: http://en.wikipedia.org/wiki/Backpropagation#Multithreaded_Backpropagation
The training data is broken up into equally large batches for each of
the threads. Each thread executes the forward and backward
propagations. The weight and threshold deltas are summed for each of
the threads. At the end of each iteration all threads must pause
briefly for the weight and threshold deltas to be summed and applied
to the neural network.
'Each thread executes forward and backward propagations' - this means, each thread just trains itself with it's part of the dataset, right? How many iterations of the training per core ?
'At the en dof each iteration all threads must pause briefly for the weight and threshold deltas to be summed and applied to neural network' - What exactly does that mean? When cores finish training with their datasets, wha does the main program do?
Thanks for any input into this!

Complete training by backpropagation is often not the thing one is really looking for, the reason being overfitting. In order to obtain a better generalization performance, approaches such as weight decay or early stopping are commonly used.
On this background, consider the following heuristic approach: Split the data in parts corresponding to the number of cores and set up a network for each core (each having the same topology). Train each network completely separated of the others (I would use some common parameters for the learning rate, etc.). You end up with a number of http://www.texify.com/img/%5Cnormalsize%5C%21N_%7B%5Ctext%7B%7D%7D.gif
trained networks http://www.texify.com/img/%5Cnormalsize%5C%21f_i%28x%29.gif.
Next, you need a scheme to combine the results. Choose http://www.texify.com/img/%5Cnormalsize%5C%21F%28x%29%3D%5Csum_%7Bi%3D1%7D%5EN%5C%2C%20%5Calpha_i%20f_i%28x%29.gif, then use least squares to adapt the parameters http://www.texify.com/img/%5Cnormalsize%5C%21%5Calpha_i.gif such that http://www.texify.com/img/%5Cnormalsize%5C%21%5Csum_%7Bj%3D1%7D%5EM%20%5C%2C%20%5Cbig%28F%28x_j%29%20-%20y_j%5Cbig%29%5E2.gif is minimized. This involves a singular value decomposition which scales linearly in the number of measurements M and thus should be feasible on a single core. Note that this heuristic approach also bears some similiarities to the Extreme Learning Machine. Alternatively, and more easily, you can simply try to average the weights, see below.
Moreover, see these answers here.
Regarding your questions:
As Kris noted it will usually be one iteration. However, in general it can be also a small number chosen by you. I would play around with choices roughly in between 1 and 20 here. Note that the above suggestion uses infinity, so to say, but then replaces the recombination step by something more appropriate.
This step simply does what it says: it sums up all weights and deltas (what exactly depends on your algoithm). Remember, what you aim for is a single trained network in the end, and one uses the splitted data for estimation of this.
To collect, often one does the following:
(i) In each thread, use your current (global) network weights for estimating the deltas by backpropagation. Then calculate new weights using these deltas.
(ii) Average these thread-local weights to obtain new global weights (alternatively, you can sum up the deltas, but this works only for a single bp iteration in the threads). Now start again with (i) in which you use the same newly calculated weights in each thread. Do this until you reach convergence.
This is a form of iterative optimization. Variations of this algorithm:
Instead of using always the same split, use random splits at each iteration step (... or at each n-th iteration). Or, in the spirit of random forests, only use a subset.
Play around with the number of iterations in a single thread (as mentioned in point 1. above).
Rather than summing up the weights, use more advanced forms of recombination (maybe a weighting with respect to the thread-internal training-error, or some kind of least squares as above).
... plus many more choices as in each complex optimization ...

For multicore parallelization it makes no sense to think about splitting the training data over threads etc. If you implement that stuff on your own you will most likely end up with a parallelized implementation that is slower than the sequential implementation because you copy your data too often.
By the way, in the current state of the art, people usually use mini-batch stochastic gradient descent for optimization. The reason is that you can simply forward propagate and backpropagate mini-batches of samples in parallel but batch gradient descent is usually much slower than stochastic gradient descent.
So how do you parallelize the forward propagation and backpropagation? You don't have to create threads manually! You can simply write down the forward propagation with matrix operations and use a parallelized linear algebra library (e.g. Eigen) or you can do the parallelization with OpenMP in C++ (see e.g. OpenANN).
Today, leading edge libraries for ANNs don't do multicore parallelization (see here for a list). You can use GPUs to parallelize matrix operations (e.g. with CUDA) which is orders of magnitude faster.

Best of breed indexing data structures for Extremely Large time-series

I'd like to ask fellow SO'ers for their opinions regarding best of breed data structures to be used for indexing time-series (aka column-wise data, aka flat linear).
Two basic types of time-series exist based on the sampling/discretisation characteristic:
Regular discretisation (Every sample is taken with a common frequency)
Irregular discretisation(Samples are taken at arbitary time-points)
Queries that will be required:
All values in the time range [t0,t1]
All values in the time range [t0,t1] that are greater/less than v0
All values in the time range [t0,t1] that are in the value range[v0,v1]
The data sets consist of summarized time-series (which sort of gets over the Irregular discretisation), and multivariate time-series. The data set(s) in question are about 15-20TB in size, hence processing is performed in a distributed manner - because some of the queries described above will result in datasets larger than the physical amount of memory available on any one system.
Distributed processing in this context also means dispatching the required data specific computation along with the time-series query, so that the computation can occur as close to the data as is possible - so as to reduce node to node communications (somewhat similar to map/reduce paradigm) - in short proximity of computation and data is very critical.
Another issue that the index should be able to cope with, is that the overwhelming majority of data is static/historic (99.999...%), however on a daily basis new data is added, think of "in the field senors" or "market data". The idea/requirement is to be able to update any running calculations (averages, garch's etc) with as low a latency as possible, some of these running calculations require historical data, some of which will be more than what can be reasonably cached.
I've already considered HDF5, it works well/efficiently for smaller datasets but starts to drag as the datasets become larger, also there isn't native parallel processing capabilities from the front-end.
Looking for suggestions, links, further reading etc. (C or C++ solutions, libraries)

You would probably want to use some type of large, balanced tree. Like Tobias mentioned, B-trees would be the standard choice for solving the first problem. If you also care about getting fast insertions and updates, there is a lot of new work being done at places like MIT and CMU into these new "cache oblivious B-trees". For some discussion of the implementation of these things, look up Tokutek DB, they've got a number of good presentations like the following:
http://tokutek.com/downloads/mysqluc-2010-fractal-trees.pdf
Questions 2 and 3 are in general a lot harder, since they involve higher dimensional range searching. The standard data structure for doing this would be the range tree (which gives O(log^{d-1}(n)) query time, at the cost of O(n log^d(n)) storage). You generally would not want to use a k-d tree for something like this. While it is true that kd trees have optimal, O(n), storage costs, it is a fact that you can't evaluate range queries any faster than O(n^{(d-1)/d}) if you only use O(n) storage. For d=2, this would be O(sqrt(n)) time complexity; and frankly that isn't going to cut it if you have 10^10 data points (who wants to wait for O(10^5) disk reads to complete on a simple range query?)
Fortunately, it sounds like your situation you really don't need to worry too much about the general case. Because all of your data comes from a time series, you only ever have at most one value per each time coordinate. Hypothetically, what you could do is just use a range query to pull some interval of points, then as a post process go through and apply the v constraints pointwise. This would be the first thing I would try (after getting a good database implementation), and if it works then you are done! It really only makes sense to try optimizing the latter two queries if you keep running into situations where the number of points in [t0, t1] x [-infty,+infty] is orders of magnitude larger than the number of points in [t0,t1] x [v0, v1].

General ideas:
Problem 1 is fairly common: Create an index that fits into your RAM and has links to the data on the secondary storage (datastructure: B-Tree family).
Problem 2 / 3 are quite complicated since your data is so large. You could partition your data into time ranges and calculate the min / max for that time range. Using that information, you can filter out time ranges (e.g. max value for a range is 50 and you search for v0>60 then the interval is out). The rest needs to be searched by going through the data. The effectiveness greatly depends on how fast the data is changing.
You can also do multiple indices by combining the time ranges of lower levels to do the filtering faster.

It is going to be really time consuming and complicated to implement this by your self. I recommend you use Cassandra.
Cassandra can give you horizontal scalability, redundancy and allow you to run complicated map reduce functions in future.
To learn how to store time series in cassandra please take a look at:
http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra
and http://www.youtube.com/watch?v=OzBJrQZjge0.

Adaptive optimization of django transaction size

I'm bulk loading data into a django model, and have noticed that the number of objects loaded into memory before doing a commit affects the average time to save each object. I realise this can be due to many different factors, so would rather focus on optimizing this STEPSIZE variable.
What would be a simple algorithm for optimizing this variable, in realtime, while taking into account the fact that this optimum might also change during the process?
I imagine this would be some sort of gradient descent, with a bit of jitter to look for changes in the landscape? Is there a formally defined algorithm for this type of search?

I'd start out assuming that
1) Your function increases monotonically in both directions away from the optimum
2) You roughly know the size of the space of regions in which the optimum will live.
Then I'd recommend a bracket and subdivide approach as follows:
Eval you function outwards from the previous optimum in both directions. Stop the search in each direction when a value higher than the previous optimum is achieved. With the assumptions above, this will give you a bracketed interval in which the new optimum lives. Break this region into two new regions left and right by evaluating the midpoint of the region. Choose left or right based on who has the lowest values, and repeat recursively until your region is small enough for your liking.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js