Calculating spearman's rank correlation very quickly - concurrency

Is there a way of calculating spearman's rank correlation over a large set of data very quickly. We have several thousand such calculations to perform every month and the total duration of these calculations takes far too long. Is it possible to perform spearman's rank correlation using concurrent (threads) operations, or to distribute the calculation across several cores (or servers even)?

For Java developers Apache common ( org.apache.commons.math3.stat.correlation.SpearmansCorrelation ) have now a class to calculate SpearmanCorrelation.

Related

Why do I keep getting different 'time to test' results for the same dataset, algorithm and test parameters in WEKA?

I need to observe the test/training times of running different algorithms. But when I run an ML such as NB for example, at one point it gives me 1.7s as the test time and at other times 2.3s or 0.8s (all parameters are kept same).
Similarly, using a dataset with 60 features and the same number of flows but with 20 features for example, the results could show me that the smaller dataset took a longer time.
I would be grateful for an explanation or advice please. Thank you
The time that Weka is displaying is the so-called wall time, which is merely the time that elapsed between starting the evaluation and finishing it. This does not represent the actual time (as in number of CPU cycles) that your machine was performing the evaluation. Depending on other processes on your machine, other Java threads requiring CPU time, this can vary easily.
The Experimenter, in contrast to command-line execution or evaluations in the Explorer, also generates time measures based on actual CPU cycles (UserCPU_Time_...).

How do I split many variable-sized units of work into equal sized buckets?

Let's say I have 300-400 units of work, all with different sizes, where the size difference is quite large in some cases. Is it possible to split those up into a fixed number of buckets so I can balance the load across a fixed number of worker threads?
The problem that you describe is known as Multiprocessor Scheduling problem (which is similar to bin packing problem which is a generalisation of the knapsack problem). Finding the optimal scheduling is known to be NP-hard. Therefore there is no known polynomial time algorithm for finding an optimal scheduling.
A simple heuristic (non-optimal) algorithm is Longest Processing Time:
Sort the units of work, largest first
For each unit, place into bucket with earliest end time

Statistical sampling of code [duplicate]

This question already has answers here:
How to efficiently calculate a running standard deviation
(17 answers)
Closed 5 years ago.
This may be a very open ended question.
I have to quickly measure time of some section of code. I'm using the std::chrono::high_resolution_clock functionality. I have to run this code for many iterations and measure the duration.
So here is the problem: I can measure minimum and maximum duration values, and calculate average using the number of samples count. In this case, I only need to store 4 values. But I would also like to know how the data is distributed. Calculation of the standard deviation or histogram requires that all data points be stored. However, this will require either one giant initial data structure or dynamically growing data structure - both of which will affect the measured code on my embedded system.
Is there a way to calculate standard deviation for this sample using the standard deviation of the previous sample?
Calculation of the standard deviation or histogram requires that all data points be stored
That's trivially false. You can calculate a running standard deviation with Welford's algorithm, which just requires one extra variable besides the running mean and the current count of elements.
As for the histogram, you don't need to keep all the data - you just need to keep the counts for each bin, and increment the right bin each time you have a new sample. Of course for this simple approach to pay out you need to know in advance an expected range and number of bins. If this isn't possible, you can always start with small bins over a small range and scale the bins size (merging the adjacent bins) whenever you meet an element outside of the current range. Again, all this requires just a fixed quantity of memory (one integer for each bin and two values for the range).

Neural Networks training on multiple cores

Straight to the facts.
My Neural network is a classic feedforward backpropagation.
I have a historical dataset that consists of:
time, temperature, humidity, pressure
I need to predict next values basing on historical data.
This dataset is about 10MB large therefore training it on one core takes ages. I want to go multicore with the training, but i can't understand what happens with the training data for each core, and what exactly happens after cores finish working.
According to: http://en.wikipedia.org/wiki/Backpropagation#Multithreaded_Backpropagation
The training data is broken up into equally large batches for each of
the threads. Each thread executes the forward and backward
propagations. The weight and threshold deltas are summed for each of
the threads. At the end of each iteration all threads must pause
briefly for the weight and threshold deltas to be summed and applied
to the neural network.
'Each thread executes forward and backward propagations' - this means, each thread just trains itself with it's part of the dataset, right? How many iterations of the training per core ?
'At the en dof each iteration all threads must pause briefly for the weight and threshold deltas to be summed and applied to neural network' - What exactly does that mean? When cores finish training with their datasets, wha does the main program do?
Thanks for any input into this!
Complete training by backpropagation is often not the thing one is really looking for, the reason being overfitting. In order to obtain a better generalization performance, approaches such as weight decay or early stopping are commonly used.
On this background, consider the following heuristic approach: Split the data in parts corresponding to the number of cores and set up a network for each core (each having the same topology). Train each network completely separated of the others (I would use some common parameters for the learning rate, etc.). You end up with a number of http://www.texify.com/img/%5Cnormalsize%5C%21N_%7B%5Ctext%7B%7D%7D.gif
trained networks http://www.texify.com/img/%5Cnormalsize%5C%21f_i%28x%29.gif.
Next, you need a scheme to combine the results. Choose http://www.texify.com/img/%5Cnormalsize%5C%21F%28x%29%3D%5Csum_%7Bi%3D1%7D%5EN%5C%2C%20%5Calpha_i%20f_i%28x%29.gif, then use least squares to adapt the parameters http://www.texify.com/img/%5Cnormalsize%5C%21%5Calpha_i.gif such that http://www.texify.com/img/%5Cnormalsize%5C%21%5Csum_%7Bj%3D1%7D%5EM%20%5C%2C%20%5Cbig%28F%28x_j%29%20-%20y_j%5Cbig%29%5E2.gif is minimized. This involves a singular value decomposition which scales linearly in the number of measurements M and thus should be feasible on a single core. Note that this heuristic approach also bears some similiarities to the Extreme Learning Machine. Alternatively, and more easily, you can simply try to average the weights, see below.
Moreover, see these answers here.
Regarding your questions:
As Kris noted it will usually be one iteration. However, in general it can be also a small number chosen by you. I would play around with choices roughly in between 1 and 20 here. Note that the above suggestion uses infinity, so to say, but then replaces the recombination step by something more appropriate.
This step simply does what it says: it sums up all weights and deltas (what exactly depends on your algoithm). Remember, what you aim for is a single trained network in the end, and one uses the splitted data for estimation of this.
To collect, often one does the following:
(i) In each thread, use your current (global) network weights for estimating the deltas by backpropagation. Then calculate new weights using these deltas.
(ii) Average these thread-local weights to obtain new global weights (alternatively, you can sum up the deltas, but this works only for a single bp iteration in the threads). Now start again with (i) in which you use the same newly calculated weights in each thread. Do this until you reach convergence.
This is a form of iterative optimization. Variations of this algorithm:
Instead of using always the same split, use random splits at each iteration step (... or at each n-th iteration). Or, in the spirit of random forests, only use a subset.
Play around with the number of iterations in a single thread (as mentioned in point 1. above).
Rather than summing up the weights, use more advanced forms of recombination (maybe a weighting with respect to the thread-internal training-error, or some kind of least squares as above).
... plus many more choices as in each complex optimization ...
For multicore parallelization it makes no sense to think about splitting the training data over threads etc. If you implement that stuff on your own you will most likely end up with a parallelized implementation that is slower than the sequential implementation because you copy your data too often.
By the way, in the current state of the art, people usually use mini-batch stochastic gradient descent for optimization. The reason is that you can simply forward propagate and backpropagate mini-batches of samples in parallel but batch gradient descent is usually much slower than stochastic gradient descent.
So how do you parallelize the forward propagation and backpropagation? You don't have to create threads manually! You can simply write down the forward propagation with matrix operations and use a parallelized linear algebra library (e.g. Eigen) or you can do the parallelization with OpenMP in C++ (see e.g. OpenANN).
Today, leading edge libraries for ANNs don't do multicore parallelization (see here for a list). You can use GPUs to parallelize matrix operations (e.g. with CUDA) which is orders of magnitude faster.

How to calculate Gflops of a kernel

I want a measure of how much of the peak performance my kernel archives.
Say I have a NVIDIA Tesla C1060, which has a peak GFLOPS of 622.08 (~= 240Cores * 1300MHz * 2).
Now in my kernel I counted for each thread 16000 flop (4000 x (2 subtraction, 1 multiplication and 1 sqrt)). So when I have 1,000,000 threads I would come up with 16GFLOP. And as the kernel takes 0.1 seconds I would archive 160GFLOPS, which would be a quarter of the peak performance. Now my questions:
Is this approach correct?
What about comparisons (if(a>b) then....)? Do I have to consider them as well?
Can I use the CUDA profiler for easier and more accurate results? I tried the instructions counter, but I could not figure out, what the figure means.
sister question: How to calculate the achieved bandwidth of a CUDA kernel
First some general remarks:
In general, what you are doing is mostly an exercise in futility and is the reverse of how most people would probably go about performance analysis.
The first point to make is that the peak value you are quoting is for strictly for floating point multiply-add instructions (FMAD), which count as two FLOPS, and can be retired at a maximum rate of one per cycle. Other floating point operations which retire at a maximum rate of one per cycle would formally only be classified as a single FLOP, while others might require many cycles to be retired. So if you decided to quote kernel performance against that peak, you are really comparing your codes performance against a stream of pure FMAD instructions, and nothing more than that.
The second point is that when researchers quote FLOP/s values from a piece of code, they are usually using a model FLOP count for the operation, not trying to count instructions. Matrix multiplication and the Linpack LU factorization benchmarks are classic examples of this approach to performance benchmarking. The lower bound of the operation count of those calculations is exactly known, so the calculated throughput is simply that lower bound divided by the time. The actual instruction count is irrelevent. Programmers often use all sorts of techniques, including rundundant calculations, speculative or predictive calculations, and a host of other ideas to make code run faster. The actual FLOP count of such code is irrelevent, the reference is always the model FLOP count.
Finally, when looking at quantifying performance, there are usually only two points of comparison of any real interest
Does version A of the code run faster than version B on the same hardware?
Does hardware A perform better than hardware B doing the task of interest?
In the first case you really only need to measure execution time. In the second, a suitable measure usually isn't FLOP/s, it is useful operations per unit time (records per second in sorting, cells per second in a fluid mechanical simulation, etc). Sometimes, as mentioned above, the useful operations can be the model FLOP count of an operation of known theoretical complexity. But the actual floating point instruction count rarely, if ever, enters into the analysis.
If your interest is really about optimization and understanding the performance of your code, then maybe this presentation by Paulius Micikevicius from NVIDIA might be of interest.
Addressing the bullet point questions:
Is this approach correct?
Strictly speaking, no. If you are counting floating point operations, you would need to know the exact FLOP count from the code the GPU is running. The sqrt operation can consume a lot more than a single FLOP, depending on its implementation and the characteristics of the number it is operating on, for example. The compiler can also perform a lot of optimizations which might change the actual operation/instruction count. The only way to get a truly accurate count would be to disassemble compiled code and count the individual floating point operands, perhaps even requiring assumptions about the characteristics of values the code will compute.
What about comparisons (if(a>b) then....)? Do I have to consider them as well?
They are not floating point multiply-add operations, so no.
Can I use the CUDA profiler for easier and more accurate results? I tried the instructions counter, but I could not figure out, what the figure means.
Not really. The profiler can't differentiate between a floating point intruction and any other type of instruction, so (as of 2011) FLOP count from a piece of code via the profiler is not possible. [EDIT: see Greg's execellent answer below for a discussion of the FLOP counting facilities available in versions of the profiling tools released since this answer was written]
Nsight VSE (>3.2) and the Visual Profiler (>=5.5) support Achieved FLOPs calculation. In order to collect the metric the profilers run the kernel twice (using kernel replay). In the first replay the number of floating point instructions executed is collected (with understanding of predication and active mask). in the second replay the duration is collected.
nvprof and Visual Profiler have a hardcoded definition. FMA counts as 2 operations. All other operations are 1 operation. The flops_sp_* counters are thread instruction execution counts whereas flops_sp is the weighted sum so some weighting can be applied using the individual metrics. However, flops_sp_special covers a number of different instructions.
The Nsight VSE experiment configuration allows the user to define the operations per instruction type.
Nsight Visual Studio Edition
Configuring to collect Achieved FLOPS
Execute the menu command Nsight > Start Performance Analysis... to open the Activity Editor
Set Activity Type to Profile CUDA Application
In Experiment Settings set Experiments to Run to Custom
In the Experiment List add Achieved FLOPS
In the middle pane select Achieved FLOPS
In the right pane you can custom the FLOPS per instruction executed. The default weighting is for FMA and RSQ to count as 2. In some cases I have seen RSQ as high as 5.
Run the Analysis Session.
Viewing Achieved FLOPS
In the nvreport open the CUDA Launches report page.
In the CUDA Launches page select a kernel.
In the report correlation pane (bottom left) select Achieved FLOPS
nvprof
Metrics Available (on a K20)
nvprof --query-metrics | grep flop
flops_sp: Number of single-precision floating-point operations executed by non-predicated threads (add, multiply, multiply-accumulate and special)
flops_sp_add: Number of single-precision floating-point add operations executed by non-predicated threads
flops_sp_mul: Number of single-precision floating-point multiply operations executed by non-predicated threads
flops_sp_fma: Number of single-precision floating-point multiply-accumulate operations executed by non-predicated threads
flops_dp: Number of double-precision floating-point operations executed non-predicated threads (add, multiply, multiply-accumulate and special)
flops_dp_add: Number of double-precision floating-point add operations executed by non-predicated threads
flops_dp_mul: Number of double-precision floating-point multiply operations executed by non-predicated threads
flops_dp_fma: Number of double-precision floating-point multiply-accumulate operations executed by non-predicated threads
flops_sp_special: Number of single-precision floating-point special operations executed by non-predicated threads
flop_sp_efficiency: Ratio of achieved to peak single-precision floating-point operations
flop_dp_efficiency: Ratio of achieved to peak double-precision floating-point operations
Collection and Results
nvprof --devices 0 --metrics flops_sp --metrics flops_sp_add --metrics flops_sp_mul --metrics flops_sp_fma matrixMul.exe
[Matrix Multiply Using CUDA] - Starting...
==2452== NVPROF is profiling process 2452, command: matrixMul.exe
GPU Device 0: "Tesla K20c" with compute capability 3.5
MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel...
done
Performance= 6.18 GFlop/s, Time= 21.196 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: OK
Note: For peak performance, please refer to the matrixMulCUBLAS example.
==2452== Profiling application: matrixMul.exe
==2452== Profiling result:
==2452== Metric result:
Invocations Metric Name Metric Description Min Max Avg
Device "Tesla K20c (0)"
Kernel: void matrixMulCUDA<int=32>(float*, float*, float*, int, int)
301 flops_sp FLOPS(Single) 131072000 131072000 131072000
301 flops_sp_add FLOPS(Single Add) 0 0 0
301 flops_sp_mul FLOPS(Single Mul) 0 0 0
301 flops_sp_fma FLOPS(Single FMA) 65536000 65536000 65536000
NOTE: flops_sp = flops_sp_add + flops_sp_mul + flops_sp_special + (2 * flops_sp_fma) (approximately)
Visual Profiler
The Visual Profiler supports the metrics shown in the nvprof section above.