Parallelizing a for loop gives no performance gain

Parallelizing a for loop gives no performance gain - c++

I have an algorithm which converts a bayer image channel to RGB. In my implementation I have a single nested for loop which iterates over the bayer channel, calculates the rgb index from the bayer index and then sets that pixel's value from the bayer channel.
The main thing to notice here is that each pixel can be calculated independently from other pixels (doesn't rely on previous calculations) and so the algorithm is a natural candidate for paralleization. The calculation does however rely on some preset arrays which all threads will be accessing in the same time but will not change.
However, when I tried parallelizing the main forwith MS's cuncurrency::parallel_for I gained no boost in performance. In fact, for an input of size 3264X2540 running over a 4-core CPU, the non parallelized version ran in ~34ms and the parallelized version ran in ~69ms (averaged over 10 runs). I confirmed that the operation was indeed parallelized (3 new threads were created for the task).
Using Intel's compiler with tbb::parallel_for gave near exact results.
For comparison, I started out with this algorithm implemented in C# in which I also used parallel_for loops and there I encountered near X4 performance gains (I opted for C++ because for this particular task C++ was faster even with a single core).
Any ideas what is preventing my code from parallelizing well?
My code:
template<typename T>
void static ConvertBayerToRgbImageAsIs(T* BayerChannel, T* RgbChannel, int Width, int Height, ColorSpace ColorSpace)
{
//Translates index offset in Bayer image to channel offset in RGB image
int offsets[4];
//calculate offsets according to color space
switch (ColorSpace)
{
case ColorSpace::BGGR:
offsets[0] = 2;
offsets[1] = 1;
offsets[2] = 1;
offsets[3] = 0;
break;
...other color spaces
}
memset(RgbChannel, 0, Width * Height * 3 * sizeof(T));
parallel_for(0, Height, [&] (int row)
{
for (auto col = 0, bayerIndex = row * Width; col < Width; col++, bayerIndex++)
{
auto offset = (row%2)*2 + (col%2); //0...3
auto rgbIndex = bayerIndex * 3 + offsets[offset];
RgbChannel[rgbIndex] = BayerChannel[bayerIndex];
}
});
}

First of all, your algorithm is memory bandwidth bounded. That is memory load/store would outweigh any index calculations you do.
Vector operations like SSE/AVX would not help either - you are not doing any intensive calculations.
Increasing work amount per iteration is also useless - both PPL and TBB are smart enough, to not create thread per iteration, they would use some good partition, which would additionaly try to preserve locality. For instance, here is quote from TBB::parallel_for:
When worker threads are available, parallel_for executes iterations is non-deterministic order. Do not rely upon any particular execution order for correctness. However, for efficiency, do expect parallel_for to tend towards operating on consecutive runs of values.
What really matters is to reduce memory operations. Any superfluous traversal over input or output buffer is poison for performance, so you should try to remove your memset or do it in parallel too.
You are fully traversing input and output data. Even if you skip something in output - that doesn't mater, because memory operations are happening by 64 byte chunks at modern hardware. So, calculate size of your input and output, measure time of algorithm, divide size/time and compare result with maximal characteristics of your system (for instance, measure with benchmark).
I have made test for Microsoft PPL, OpenMP and Native for, results are (I used 8x of your height):
Native_For 0.21 s
OpenMP_For 0.15 s
Intel_TBB_For 0.15 s
MS_PPL_For 0.15 s
If remove memset then:
Native_For 0.15 s
OpenMP_For 0.09 s
Intel_TBB_For 0.09 s
MS_PPL_For 0.09 s
As you can see memset (which is highly optimized) is responsoble for significant amount of execution time, which shows how your algorithm is memory bounded.
FULL SOURCE CODE:
#include <boost/exception/detail/type_info.hpp>
#include <boost/mpl/for_each.hpp>
#include <boost/mpl/vector.hpp>
#include <boost/progress.hpp>
#include <tbb/tbb.h>
#include <iostream>
#include <ostream>
#include <vector>
#include <string>
#include <omp.h>
#include <ppl.h>
using namespace boost;
using namespace std;
const auto Width = 3264;
const auto Height = 2540*8;
struct MS_PPL_For
{
template<typename F,typename Index>
void operator()(Index first,Index last,F f) const
{
concurrency::parallel_for(first,last,f);
}
};
struct Intel_TBB_For
{
template<typename F,typename Index>
void operator()(Index first,Index last,F f) const
{
tbb::parallel_for(first,last,f);
}
};
struct Native_For
{
template<typename F,typename Index>
void operator()(Index first,Index last,F f) const
{
for(; first!=last; ++first) f(first);
}
};
struct OpenMP_For
{
template<typename F,typename Index>
void operator()(Index first,Index last,F f) const
{
#pragma omp parallel for
for(auto i=first; i<last; ++i) f(i);
}
};
template<typename T>
struct ConvertBayerToRgbImageAsIs
{
const T* BayerChannel;
T* RgbChannel;
template<typename For>
void operator()(For for_)
{
cout << type_name<For>() << "\t";
progress_timer t;
int offsets[] = {2,1,1,0};
//memset(RgbChannel, 0, Width * Height * 3 * sizeof(T));
for_(0, Height, [&] (int row)
{
for (auto col = 0, bayerIndex = row * Width; col < Width; col++, bayerIndex++)
{
auto offset = (row % 2)*2 + (col % 2); //0...3
auto rgbIndex = bayerIndex * 3 + offsets[offset];
RgbChannel[rgbIndex] = BayerChannel[bayerIndex];
}
});
}
};
int main()
{
vector<float> bayer(Width*Height);
vector<float> rgb(Width*Height*3);
ConvertBayerToRgbImageAsIs<float> work = {&bayer[0],&rgb[0]};
for(auto i=0;i!=4;++i)
{
mpl::for_each<mpl::vector<Native_For, OpenMP_For,Intel_TBB_For,MS_PPL_For>>(work);
cout << string(16,'_') << endl;
}
}

Synchronization overhead
I would guess that the amount of work done per iteration of the loop is too small. Had you split the image into four parts and ran the computation in parallel, you would have noticed a large gain. Try to design the loop in a way that would case less iterations and more work per iteration. The reasoning behind this is that there is too much synchronization done.
Cache usage
An important factor may be how the data is split (partitioned) for the processing. If the proceessed rows are separated as in the bad case below, then more rows will cause a cache miss. This effect will become more important with each additional thread, because the distance between rows will be greater. If you are certain that the parallelizing function performs reasonable partitioning, then manual work-splitting will not give any results
bad good
****** t1 ****** t1
****** t2 ****** t1
****** t1 ****** t1
****** t2 ****** t1
****** t1 ****** t2
****** t2 ****** t2
****** t1 ****** t2
****** t2 ****** t2
Also make sure that you access your data in the same way it is aligned; it is possible that each call to offset[] and BayerChannel[] is a cache miss. Your algorithm is very memory intensive. Almost all operations are either accessing a memory segment or writing to it. Preventing cache misses and minimizing memory access is crucial.
Code optimizations
the optimizations shown below may be done by the compiler and may not give better results. It is worth knowing that they can be done.
// is the memset really necessary?
//memset(RgbChannel, 0, Width * Height * 3 * sizeof(T));
parallel_for(0, Height, [&] (int row)
{
int rowMod = (row & 1) << 1;
for (auto col = 0, bayerIndex = row * Width, tripleBayerIndex=row*Width*3; col < Width; col+=2, bayerIndex+=2, tripleBayerIndex+=6)
{
auto rgbIndex = tripleBayerIndex + offsets[rowMod];
RgbChannel[rgbIndex] = BayerChannel[bayerIndex];
//unrolled the loop to save col & 1 operation
rgbIndex = tripleBayerIndex + 3 + offsets[rowMod+1];
RgbChannel[rgbIndex] = BayerChannel[bayerIndex+1];
}
});

Here comes my suggestion:
Computer larger chunks in parallel
get rid of modulo/multiplication
unroll inner loop to compute one full pixel (simplifies code)
template<typename T> void static ConvertBayerToRgbImageAsIsNew(T* BayerChannel, T* RgbChannel, int Width, int Height)
{
// convert BGGR->RGB
// have as many threads as the hardware concurrency is
parallel_for(0, Height, static_cast<int>(Height/(thread::hardware_concurrency())), [&] (int stride)
{
for (auto row = stride; row<2*stride; row++)
{
for (auto col = row*Width, rgbCol =row*Width; col < row*Width+Width; rgbCol +=3, col+=4)
{
RgbChannel[rgbCol+0] = BayerChannel[col+3];
RgbChannel[rgbCol+1] = BayerChannel[col+1];
// RgbChannel[rgbCol+1] += BayerChannel[col+2]; // this line might be left out if g is used unadjusted
RgbChannel[rgbCol+2] = BayerChannel[col+0];
}
}
});
}
This code is 60% faster than the original version but still only half as fast as the non parallelized version on my laptop. This seemed to be due to the memory boundedness of the algorithm as others have pointed out already.
edit: But I was not happy with that. I could greatly improve the parallel performance when going from parallel_for to std::async:
int hc = thread::hardware_concurrency();
future<void>* res = new future<void>[hc];
for (int i = 0; i<hc; ++i)
{
res[i] = async(Converter<char>(bayerChannel, rgbChannel, rows, cols, rows/hc*i, rows/hc*(i+1)));
}
for (int i = 0; i<hc; ++i)
{
res[i].wait();
}
delete [] res;
with converter being a simple class:
template <class T> class Converter
{
public:
Converter(T* BayerChannel, T* RgbChannel, int Width, int Height, int startRow, int endRow) :
BayerChannel(BayerChannel), RgbChannel(RgbChannel), Width(Width), Height(Height), startRow(startRow), endRow(endRow)
{
}
void operator()()
{
// convert BGGR->RGB
for(int row = startRow; row < endRow; row++)
{
for (auto col = row*Width, rgbCol =row*Width; col < row*Width+Width; rgbCol +=3, col+=4)
{
RgbChannel[rgbCol+0] = BayerChannel[col+3];
RgbChannel[rgbCol+1] = BayerChannel[col+1];
// RgbChannel[rgbCol+1] += BayerChannel[col+2]; // this line might be left out if g is used unadjusted
RgbChannel[rgbCol+2] = BayerChannel[col+0];
}
};
}
private:
T* BayerChannel;
T* RgbChannel;
int Width;
int Height;
int startRow;
int endRow;
};
This is now 3.5 times faster than the non parallelized version. From what I have seen in the profiler so far, I assume that the work stealing approach of parallel_for incurs a lot of waiting and synchronization overhead.

I have not used tbb::parallel_for not cuncurrency::parallel_for, but if your numbers are correct they seem to carry too much overhead. However, I strongly advice you to run more that 10 iterations when testing, and also be sure to do as many warmup iterations before timing.
I tested your code exactly using three different methods, averaged over 1000 tries.
Serial: 14.6 += 1.0 ms
std::async: 13.6 += 1.6 ms
workers: 11.8 += 1.2 ms
The first is serial calculation. The second is done using four calls to std::async. The last is done by sending four jobs to four already started (but sleeping) background threads.
The gains aren't big, but at least they are gains. I did the test on a 2012 MacBook Pro, with dual hyper threaded cores = 4 logical cores.
For reference, here's my std::async parallel for:
template<typename Int=int, class Fun>
void std_par_for(Int beg, Int end, const Fun& fun)
{
auto N = std::thread::hardware_concurrency();
std::vector<std::future<void>> futures;
for (Int ti=0; ti<N; ++ti) {
Int b = ti * (end - beg) / N;
Int e = (ti+1) * (end - beg) / N;
if (ti == N-1) { e = end; }
futures.emplace_back( std::async([&,b,e]() {
for (Int ix=b; ix<e; ++ix) {
fun( ix );
}
}));
}
for (auto&& f : futures) {
f.wait();
}
}

Things to check or do
Are you using a Core 2 or older processor? They have a very narrow memory bus that's easy to saturate with code like this. In contrast, 4-channel Sandy Bridge-E processors require multiple threads to saturate the memory bus (it's not possible for a single memory-bound thread to fully saturate it).
Have you populated all of your memory channels? E.g. if you have a dual-channel CPU but have just one RAM card installed or two that are on the same channel, you're getting half the available bandwidth.
How are you timing your code?
The timing should be done inside the application like Evgeny Panasyuk suggests.
You should do multiple runs within the same application. Otherwise, you may be timing one-time startup code to launch the thread pools, etc.
Remove the superfluous memset, as others have explained.
As ogni42 and others have suggested, unroll your inner loop (I didn't bother checking the correctness of that solution, but if it's wrong, you should be able to fix it). This is orthogonal to the main question of parallelization, but it's a good idea anyway.
Make sure your machine is otherwise idle when doing performance testing.
Additional timings
I've merged the suggestions of Evgeny Panasyuk and ogni42 in a bare-bones C++03 Win32 implementation:
#include "stdafx.h"
#include <omp.h>
#include <vector>
#include <iostream>
#include <stdio.h>
using namespace std;
const int Width = 3264;
const int Height = 2540*8;
class Timer {
private:
string name;
LARGE_INTEGER start;
LARGE_INTEGER stop;
LARGE_INTEGER frequency;
public:
Timer(const char *name) : name(name) {
QueryPerformanceFrequency(&frequency);
QueryPerformanceCounter(&start);
}
~Timer() {
QueryPerformanceCounter(&stop);
LARGE_INTEGER time;
time.QuadPart = stop.QuadPart - start.QuadPart;
double elapsed = ((double)time.QuadPart /(double)frequency.QuadPart);
printf("%-20s : %5.2f\n", name.c_str(), elapsed);
}
};
static const int offsets[] = {2,1,1,0};
template <typename T>
void Inner_Orig(const T* BayerChannel, T* RgbChannel, int row)
{
for (int col = 0, bayerIndex = row * Width;
col < Width; col++, bayerIndex++)
{
int offset = (row % 2)*2 + (col % 2); //0...3
int rgbIndex = bayerIndex * 3 + offsets[offset];
RgbChannel[rgbIndex] = BayerChannel[bayerIndex];
}
}
// adapted from ogni42's answer
template <typename T>
void Inner_Unrolled(const T* BayerChannel, T* RgbChannel, int row)
{
for (int col = row*Width, rgbCol =row*Width;
col < row*Width+Width; rgbCol +=3, col+=4)
{
RgbChannel[rgbCol+0] = BayerChannel[col+3];
RgbChannel[rgbCol+1] = BayerChannel[col+1];
// RgbChannel[rgbCol+1] += BayerChannel[col+2]; // this line might be left out if g is used unadjusted
RgbChannel[rgbCol+2] = BayerChannel[col+0];
}
}
int _tmain(int argc, _TCHAR* argv[])
{
vector<float> bayer(Width*Height);
vector<float> rgb(Width*Height*3);
for(int i = 0; i < 4; ++i)
{
{
Timer t("serial_orig");
for(int row = 0; row < Height; ++row) {
Inner_Orig<float>(&bayer[0], &rgb[0], row);
}
}
{
Timer t("omp_dynamic_orig");
#pragma omp parallel for
for(int row = 0; row < Height; ++row) {
Inner_Orig<float>(&bayer[0], &rgb[0], row);
}
}
{
Timer t("omp_static_orig");
#pragma omp parallel for schedule(static)
for(int row = 0; row < Height; ++row) {
Inner_Orig<float>(&bayer[0], &rgb[0], row);
}
}
{
Timer t("serial_unrolled");
for(int row = 0; row < Height; ++row) {
Inner_Unrolled<float>(&bayer[0], &rgb[0], row);
}
}
{
Timer t("omp_dynamic_unrolled");
#pragma omp parallel for
for(int row = 0; row < Height; ++row) {
Inner_Unrolled<float>(&bayer[0], &rgb[0], row);
}
}
{
Timer t("omp_static_unrolled");
#pragma omp parallel for schedule(static)
for(int row = 0; row < Height; ++row) {
Inner_Unrolled<float>(&bayer[0], &rgb[0], row);
}
}
printf("-----------------------------\n");
}
return 0;
}
Here are the timings I see on a triple-channel 8-way hyperthreaded Core i7-950 box:
serial_orig : 0.13
omp_dynamic_orig : 0.10
omp_static_orig : 0.10
serial_unrolled : 0.06
omp_dynamic_unrolled : 0.04
omp_static_unrolled : 0.04
The "static" versions tell the compiler to evenly divide up the work between threads at loop entry. This avoids the overhead of attempting to do work stealing or other dynamic load balancing. For this code snippet, it doesn't seem to make a difference, even though the workload is very uniform across threads.

The performance reduction might be happening because your are trying to distribute for loop on "row" number of cores, which wont be available and hence again it become like a sequential execution with the overhead of parallelism.

Not very familiar with parallel for loops but it seems to me the contention is in the memory access. It appears your threads are overlapping access to the same pages.
Can you break up your array access into 4k chunks somewhat align with the page boundary?

There is no point talking about parallel performance before not having optimized the for loop for serial code. Here is my attempt at that (some good compilers may be able to obtain similarly optimized versions, but I'd rather not rely on that)
parallel_for(0, Height, [=] (int row) noexcept
{
for (auto col=0, bayerindex=row*Width,
rgb0=3*bayerindex+offset[(row%2)*2],
rgb1=3*bayerindex+offset[(row%2)*2+1];
col < Width; col+=2, bayerindex+=2, rgb0+=6, rgb1+=6 )
{
RgbChannel[rgb0] = BayerChannel[bayerindex ];
RgbChannel[rgb1] = BayerChannel[bayerindex+1];
}
});

Related

Accumulating Doubles Into Bins via intrinsics

I have a vector of observations and an equal length vector of offsets assigning observations to a set of bins. The value of each bin should be the sum of all observations assigned to that bin, and I'm wondering if there's a vectorized method to do the reduction.
A naive implementation is below:
const int N_OBS = 100`000`000;
const int N_BINS = 16;
double obs[N_OBS]; // Observations
int8_t offsets[N_OBS];
double acc[N_BINS] = {0};
for (int i = 0; i < N_OBS; ++i) {
acc[offsets[i]] += obs[i]; // accumulate obs value into its assigned bin
}
Is this possible using simd/avx intrinsics? Something similar to the above will be run millions of times. I've looked at scatter/gather approaches, but can't seem to figure out a good way to get it done.

Modern CPUs are surprisingly good running your naïve version. On AMD Zen3, I’m getting 48ms for 100M random numbers on input, that’s 18 GB/sec RAM read bandwidth. That’s like 35% of the hard bandwidth limit on my computer (dual-channel DDR4-3200).
No SIMD gonna help, I’m afraid. Still, the best version I got is the following. Compile with OpenMP support, the switch depends on your C++ compiler.
void computeHistogramScalarOmp( const double* rsi, const int8_t* indices, size_t length, double* rdi )
{
// Count of OpenMP threads = CPU cores to use
constexpr int ompThreadsCount = 4;
// Use independent set of accumulators per thread, otherwise concurrency gonna corrupt data.
// Aligning by 64 = cache line, we want to assign cache lines to CPU cores, sharing them is extremely expensive
alignas( 64 ) double accumulators[ 16 * ompThreadsCount ];
memset( &accumulators, 0, sizeof( accumulators ) );
// Minimize OMP overhead by dispatching very few large tasks
#pragma omp parallel for schedule(static, 1)
for( int i = 0; i < ompThreadsCount; i++ )
{
// Grab a slice of the output buffer
double* const acc = &accumulators[ i * 16 ];
// Compute a slice of the source data for this thread
const size_t first = i * length / ompThreadsCount;
const size_t last = ( i + 1 ) * length / ompThreadsCount;
// Accumulate into thread-local portion of the buffer
for( size_t i = first; i < last; i++ )
{
const int8_t idx = indices[ i ];
acc[ idx ] += rsi[ i ];
}
}
// Reduce 16*N scalars to 16 with a few AVX instructions
for( int i = 0; i < 16; i += 4 )
{
__m256d v = _mm256_load_pd( &accumulators[ i ] );
for( int j = 1; j < ompThreadsCount; j++ )
{
__m256d v2 = _mm256_load_pd( &accumulators[ i + j * 16 ] );
v = _mm256_add_pd( v, v2 );
}
_mm256_storeu_pd( rdi + i, v );
}
}
The above version results in 20.5ms time, translates to 88% of RAM bandwidth limit.
P.S. I have no idea why the optimal threads count is 4 here, I have 8 cores/16 threads in the CPU. Both lower and higher values decrease the bandwidth. The constant is probably CPU-specific.

If indeed the offsets do not change for thousands (probably even tens) of times, it is likely worthwile to "transpose" them, i.e., to store all indices which need to be added to acc[0], then all indices which need to be added to acc[1], etc.
Essentially, what you are doing originally is a sparse-matrix times dense-vector product with the matrix in compressed-column-storage format (without explicitly storing the 1-values).
As shown in this answer sparse GEMV products are usually faster if the matrix is stored in compressed-row-storage (even without AVX2's gather instruction, you don't need to load and store the accumulated value every time).
Untested example implementation:
using sparse_matrix = std::vector<std::vector<int> >;
// call this once:
sparse_matrix transpose(uint8_t const* offsets, int n_bins, int n_obs){
sparse_matrix res;
res.resize(n_bins);
// count entries for each bin:
for(int i=0; i<n_obs; ++i) {
// assert(offsets[i] < n_bins);
res[offsets[i]].push_back(i);
}
return res;
}
void accumulate(double acc[], sparse_matrix const& indexes, double const* obs){
for(std::size_t row=0; row<indexes.size(); ++row) {
double sum = 0;
for(int col : indexes[row]) {
// you can manually vectorize this using _mm256_i32gather_pd,
// but clang/gcc should autovectorize this with -ffast-math -O3 -march=native
sum += obs[col];
}
acc[row] = sum;
}
}

CNTK Low GPU utilisation training via C++ API

I've written some fairly standard CNTK code to train a simple feed forward network using the C++ API. During training I get a speed up of about 4 times using GPU over CPU (CPU is 100% utilised) but that is with virtually zero GPU utilisation during training when I use GPU. Using the CNTK C++ APIs how can one increase GPU utilisation? My results are fine but seemingly too slow.
I'm running CNTK on a recent Lenovo, Windows 10, Intel CPU, Quadro P3200.
I'm applying CNTK to in-memory data generated on the fly in my program.
Simple dataset with 2 scalar features and 1 scalar label data, say 10,000 items. The data is actually generated from Monte Carlo simulations.
As there is no C++ reader API for in-memory data I'm using the Value::CreateSequence API to create data to pass to the trainer. Note the data is created on the card, read only (assuming 'device' parameter is GPU(0)) to reduce host/card transfers. The data structures are originally C++ eigen matrices. I'm using Nesterov learner in the code sample but have tried SGD and SGD with momentum with similar results. All calcs are using float which should be fast on the Quadro.
void RMLearning::OLSDeepLearning(const MatrixXd& X, const MatrixXd& Y, const DeviceDescriptor& device)
{
auto inputVarName = L"features";
auto inputVar = InputVariable({ m_inputDim }, DataType::Float, inputVarName);
auto regressionOutput = FullyConnectedFeedForwardRegressionNet(inputVar, m_outputDim, m_hiddenLayersDim, m_numHiddenLayers, device, m_nonLinearity, L"regressionOutput");
auto labelsVarName = L"Labels";
auto labelsVar = InputVariable({ m_outputDim }, DataType::Float, labelsVarName);
auto trainingLoss = ReduceSum(CNTK::SquaredError(regressionOutput, labelsVar, L"SquaredErrorLossFunction"), Axis::AllAxes(), L"SquaredErrorLossFunction");
if (m_SaveAndReLoadModel)
SaveReloadESModel(regressionOutput, trainingLoss, inputVar, labelsVar, device, inputVarName, labelsVarName);
m_prediction = regressionOutput;
ProgressWriterPtr pw = MakeSharedObject<MyProgressWriter>(0, 0, 0, 0, 0, 0);
// Nesterov learner (SGD with momentum)
m_learner = MomentumSGDLearner(regressionOutput->Parameters(), m_learningRate, m_Momentum, true);
m_trainer = CreateTrainer(regressionOutput, trainingLoss, m_prediction, { m_learner }, { pw });
size_t numSamples = X.rows();
m_inputData.resize(m_inputDim * numSamples);
m_labelData.resize(m_outputDim * numSamples, 0);
m_olsFittedValues.resize(numSamples);
flattenXYData(numSamples, X, Y);
NDShape inputShape({ m_inputDim });
auto dim = inputShape.Dimensions();
ValuePtr inputValue = Value::CreateSequence(inputShape, m_inputData, device,true);
NDShape labelShape({ m_outputDim });
ValuePtr labelValue = Value::CreateSequence(labelShape, m_labelData, device,true);
ValuePtr outputValue, predictionErrorValue;
//main training loop
//at the moment CNTK C++ API lacks "reader" functions for taking data from memory.
//the reader support the automatic minibatch extraction from full batch.
//Here we are training full batch.
if (m_printTrainingProgress) std::cout << "OLS learning..." << std::endl;
for (size_t i = 0; i < m_iterationCount; ++i)
{
m_trainer->TrainMinibatch({ {inputVar, inputValue}, {labelsVar, labelValue} }, device);
if (m_printTrainingProgress) m_trainer->SummarizeTrainingProgress();
}
//record fitted training weights
getTrainingWeightsVaR();
//get fitted values and VaR weights
EvaluationSequenceUsingDense(m_trainer->EvaluationFunction(), m_inputData, m_olsFittedValues, false, device);
if (m_printTrainingSummary)
{
std::unordered_map<Variable, ValuePtr> tmap = { {inputVar, inputValue} };
m_trainer->TestMinibatch(tmap, device, false);
m_trainer->SummarizeTestProgress();
}
}
void RMLearning::flattenXYData(const size_t &numSamples, const Eigen::MatrixXd & X, const Eigen::MatrixXd & Y)
{
//map X matrix to input data vector
size_t j = 0;
for (size_t i1 = 0; i1 < numSamples; ++i1)
{
for (size_t i2 = 0; i2 < m_inputDim; ++i2)
{
m_inputData[j] = (float)X(i1, i2);
j++;
}
}
//map Y matrix to label data vector
j = 0;
for (size_t i1 = 0; i1 < numSamples; ++i1)
{
for (size_t i2 = 0; i2 < m_outputDim; ++i2)
{
m_labelData[j] = (float)Y(i1, i2);
j++;
}
}
}
//Neural network structure setup
inline FunctionPtr RMLearning::FullyConnectedFeedForwardRegressionNet(Variable input,
size_t outputLayerDim,
size_t hiddenLayerDim,
size_t numHiddenLayers,
const DeviceDescriptor& device,
const std::function<FunctionPtr(const FunctionPtr&)>& nonLinearity,
const std::wstring& outputName,
unsigned long seed)
{
assert(numHiddenLayers >= 1);
auto regressionRoot = FullyConnectedDNNLayer(input, hiddenLayerDim, device, nonLinearity, L"", seed);
for (size_t i = 1; i < numHiddenLayers; ++i)
regressionRoot = FullyConnectedDNNLayer(regressionRoot, hiddenLayerDim, device, nonLinearity, L"", seed);
//output layer
regressionRoot = FullyConnectedLinearLayer(regressionRoot, outputLayerDim, device, outputName, seed);
return regressionRoot;
}
With NN of say 3 hidden layers of dim 20, training full batches of say 20,000 records on 1,000 iterations takes about 2 seconds using GPU (with nearly zero utilisation) and 8 seconds using CPU (6 core Intel 100% utilised). Given that CNTK shields the developer from most of the low level detail of CUDA does it provide any API or parameters that may be used to increase utilisation? In my situation is the GPU likely waiting on CPU for something? What strategies have others found useful?

Matrix transpose and population count

I have a square boolean matrix M of size N, stored by rows and I want to count the number of bits set to 1 for each column.
For instance for n=4:
1101
0101
0001
1001
M stored as { { 1,1,0,1}, {0,1,0,1}, {0,0,0,1}, {1,0,0,1} };
result = { 2, 2, 0, 4};
I can obviously
transpose the matrix M into a matrix M'
popcount each row of M'.
Good algorithms exist for matrix transposition and popcounting through bit manipulation.
My question is: would it be possible to "merge" such algorithms into a single one ?
Note that N could be quite large (say 1024 and more) regarding 64 bits architecture.

Related: Count each bit-position separately over many 64-bit bitmasks, with AVX but not AVX2 and https://github.com/mklarqvist/positional-popcount
I had another idea which I haven't finished writing up nicely.
Godbolt link to messy work-in-progress which doesn't have correct loop bounds / cleanup, but for large buffers runs ~3x faster than #edrezen's version on my Skylake i7-6700k, with g++7.3 -O3 -march=native. See the test_SWAR_avx2 function. (I know it doesn't compile on Godbolt; Agner Fog's asmlib.h isn't present.)
I might have some columns in the wrong order, too, but from stepping through the asm I think it's doing the right amount of work. i.e. any necessary bugfixes won't slow it down.
I used 16-bit accumulators, so another outer loop might be necessary if you care about inputs large enough to overflow 16-bit per-column counters.
Interesting observation: An earlier buggy version of my loop used sum0123 twice in store_globalsums_from_vec16, leaving sum4567 unused, so it optimized away in the main loop. With less work, gcc fully unrolled the large for(int i=0 ; i<5 ; i++) loop, and the code ran slower, like about 1 cycle per byte instead of 0.5. The loop was probably too big for the uop cache or something (I didn't profile yet but a front-end decode bottleneck would explain it). For some reason #edrezen's version is only running at about 1.5c/B for me, not the ~1.25 reported in the answer. My CPU is actually running 3.9GHz, but Agner Fog's library detects it at 4.0, but that's not enough to explain it.
Also, gcc spills sum4567_16bit to the stack, so we're already pushing the boundary of register pressure without AVX512. It's updated infrequently and isn't a problem, but needing more accumulators in the inner loop could be.
Your data layout isn't clear about when the number of columns isn't 32.
It seems that for each uint32_t chunk of 32 columns, you have all the rows stored contiguously in memory. i.e. looping over the rows for a column is efficient. If you had more than 32 columns, the rows for columns 32..63 will be contiguous and come after all the rows for columns 0..31.
(If instead you have all the columns for a single row contiguous, you could still use this idea, but might need to spill/reload some accumulators to memory, or let the compiler do that for you if it makes good choices.)
So loading a 32-byte (8 dword) vector gets 8 rows of data for one column chunk. That's extremely convenient, and allows widening from 1-bit (in memory) to 2-bit accumulators, then grab more data before we widen to 4-bit, and so on, summing along the way so we get significant work done while the data is still dense. (Rather than only adding 1 bit (0 or 1) per byte to vector accumulators.)
The more we unroll, the more data we can grab from memory to make better use of the coding space in our vectors. i.e. our variables have higher entropy. Throwing around more data (in terms of bits of memory that contributed to it) per vpaddb/w/d/q or unpack/shuffle instruction is a Good Thing.
Accumulators narrower than 1 byte within a SIMD vector is basically an https://en.wikipedia.org/wiki/SWAR technique, where you have to AND away bits that you shift past an element boundary, because we don't have SIMD element boundaries to do it for us. (And we avoid overflow anyway, so ADD carrying into the next element isn't a problem.)
Each inner loop iteration:
take a vector of data from the same columns in each of 2 or 3 (groups of) rows. So you either have 3 * 8 rows from one chunk of 32 columns, or 3 rows of 256 columns.
mask them with set1(0b01010101) to get the even (low) bits, and with (vec>>1) & mask (_mm256_srli_epi32(v,1)) to get the odd (high) bits. Use _mm256_add_epi8 to accumulate within those 2-bit accumulators. They can't overflow with only 3 ones, so carry-propagation boundaries don't actually matter.
Each byte of your vector has 4 separate vertical sums, and you have two vectors (odd/even).
Repeat the above again, to get another pair of vectors from 3 vectors of data from memory.
Combine again to get 4 vectors of 4-bit accumulators (with possible values 0..6). Still without mixing bits from within a single 32-bit element, of course, because we must never do that. Shifts only move bits for odd / high columns to the bottom of the 2-bit or 4-bit unit that contains them so they can be added with bits that were moved the same way in other vectors.
_mm256_unpacklo/hi_epi8 and mask or shift+mask to get 8-bit accumulators
Put the above in a loop that runs up to 5 times, so the 0..12 accumulator values go up to 0..60 (i.e. leaving 2 bits of headroom for unpacking the 8-bit accumulators, using all their coding space.)
If you have the data layout from your answer, then we can add data from dword elements within the same vector. We can do that so we don't run out of registers when widening our accumulators up to 16-bit (because x86-64 only has 16 YMM registers, and we need some for constants.)
_mm256_unpacklo/hi_epi16 and add, to interleave pairs of 8-bit counters so a group of counters for the same column has expanded from a dword to a qword.
Repeat this general idea to reduce the number of registers (or __m256i variables) your accumulators are spread over.
Efficiently handling the lack of a lane-crossing 2-input byte or word shuffle is inconvenient, but it's a pretty small part of the total work. vextracti128 / vpaddb xmm -> vpmovzxbw worked well enough.

I made some benchmark between the two approaches:
transpose + popcount
update row by row
I wrote a naive version and an AVX2 one for both approaches. I used some functions (found on stackoverflow or elsewhere) for the AVX2 "transpose+popcount" approach.
In my test, I make the assumption that the input is a nbRowsx32 matrix in a bits packed format (nbRows itself being a multiple of 32); the matrix is therefore stored as an array of uint32_t.
The code is the following:
#include <cinttypes>
#include <cstdio>
#include <cstring>
#include <cmath>
#include <cassert>
#include <chrono>
#include <immintrin.h>
#include <asmlib.h>
using namespace std;
using namespace std::chrono;
// see https://stackoverflow.com/questions/24225786/fastest-way-to-unpack-32-bits-to-a-32-byte-simd-vector
static __m256i expand_bits_to_bytes (uint32_t x);
// see https://mischasan.wordpress.com/2011/10/03/the-full-sse2-bit-matrix-transpose-routine/
static void sse_trans(char const *inp, char *out);
static double deviation (double n, double sum2, double sum);
////////////////////////////////////////////////////////////////////////////////
// Naive approach (matrix transposition)
////////////////////////////////////////////////////////////////////////////////
void test_transpose_popcnt_naive (uint64_t nbRows, const uint32_t* bitmap, uint64_t* globalSums)
{
assert (nbRows%32==0);
uint8_t transpo[32][32]; memset (transpo, 0, sizeof(transpo));
for (uint64_t k=0; k<nbRows; k+=32)
{
// We unpack and transpose the input into a 32x32 bytes matrix
for (size_t row=0; row<32; row++)
{
for (size_t col=0; col<32; col++) { transpo[col][row] = (bitmap[k+row] >> col) & 1 ; }
}
for (size_t row=0; row<32; row++)
{
// We popcount the current row
u_int8_t sum=0;
for (size_t col=0; col<32; col++) { sum += transpo[row][col]; }
// We update the corresponding global sum
globalSums[row] += sum;
}
}
}
////////////////////////////////////////////////////////////////////////////////
// Naive approach (row by row)
////////////////////////////////////////////////////////////////////////////////
void test_update_row_by_row_naive (uint64_t nbRows, const uint32_t* bitmap, uint64_t* globalSums)
{
for (uint64_t row=0; row<nbRows; row++)
{
for (size_t col=0; col<32; col++)
{
globalSums[col] += (bitmap[row] >> col) & 1;
}
}
}
////////////////////////////////////////////////////////////////////////////////
// AVX2 (matrix transposition + popcount)
////////////////////////////////////////////////////////////////////////////////
void test_transpose_popcnt_avx2 (uint64_t nbRows, const uint32_t* bitmap, uint64_t* globalSums)
{
assert (nbRows%32==0);
uint32_t transpo[32];
const uint32_t* loop = bitmap;
for (uint64_t k=0; k<nbRows; loop+=32, k+=32)
{
// We transpose the input as a 32x32 bytes matrix
sse_trans ((const char*)loop, (char*)transpo);
// We update the global sums
for (size_t i=0; i<32; i++)
{
globalSums[i] += __builtin_popcount (transpo[i]);
}
}
}
////////////////////////////////////////////////////////////////////////////////
// AVX2 approach (update totals row by row)
////////////////////////////////////////////////////////////////////////////////
// Note: we use template specialization to unroll some portions of a loop
template<int N>
void UpdateLocalSums (__m256i& localSums, const uint32_t* bitmap, uint64_t& k)
{
// We update the local sums with the current row
localSums = _mm256_sub_epi8 (localSums, expand_bits_to_bytes (bitmap[k++]));
// Go recursively
UpdateLocalSums<N-1>(localSums, bitmap, k);
}
template<>
void UpdateLocalSums<0> (__m256i& localSums, const uint32_t* bitmap, uint64_t& k)
{
}
// Dillon Davis proposal: use 4 registers holding uint32_t values and update them from local sums with AVX2
#define USE_AVX2_FOR_GRAND_TOTALS 1
void test_update_row_by_row_avx2 (uint64_t nbRows, const uint32_t* bitmap, uint64_t* globalSums)
{
union U256i { __m256i v; uint8_t a[32]; uint32_t b[8]; };
// We use 1 register for updating local totals
__m256i localSums = _mm256_setzero_si256();
#ifdef USE_AVX2_FOR_GRAND_TOTALS
// Dillon Davis proposal: use 4 registers holding uint32_t values and update them from local sums with AVX2
__m256i globalSumsReg[4]; for (size_t r=0; r<4; r++) { globalSumsReg[r] = _mm256_setzero_si256(); }
#endif
uint64_t steps = nbRows / 255;
uint64_t k=0;
const int divisorOf255 = 5;
// We iterate over all rows
for (uint64_t i=0; i<steps; i++)
{
// we update the local totals (255*32=8160 additions)
for (int j=0; j<255/divisorOf255; j++)
{
// unroll some portion of the 255 loop through template specialization
UpdateLocalSums<divisorOf255>(localSums, bitmap, k);
}
#ifdef USE_AVX2_FOR_GRAND_TOTALS
// Dillon Davis proposal: use 4 registers holding uint32_t values and update them from local sums
// We take the 128 high bits of the local sums
__m256i localSums2 = _mm256_broadcastsi128_si256(_mm256_extracti128_si256(localSums,1));
globalSumsReg[0] = _mm256_add_epi32 (globalSumsReg[0],
_mm256_cvtepu8_epi32 (_mm256_castsi256_si128 (_mm256_srli_si256(localSums, 0)))
);
globalSumsReg[1] = _mm256_add_epi32 (globalSumsReg[1],
_mm256_cvtepu8_epi32 (_mm256_castsi256_si128 (_mm256_srli_si256(localSums, 8)))
);
globalSumsReg[2] = _mm256_add_epi32 (globalSumsReg[2],
_mm256_cvtepu8_epi32 (_mm256_castsi256_si128 (_mm256_srli_si256(localSums2, 0)))
);
globalSumsReg[3] = _mm256_add_epi32 (globalSumsReg[3],
_mm256_cvtepu8_epi32 (_mm256_castsi256_si128 (_mm256_srli_si256(localSums2, 8)))
);
#else
// we update the global totals
U256i tmp = { localSums };
for (size_t k=0; k<32; k++) { globalSums[k] += tmp.a[k]; }
#endif
// we reset the local totals
localSums = _mm256_setzero_si256();
}
#ifdef USE_AVX2_FOR_GRAND_TOTALS
// We update the global totals into the final uint32_t array
for (size_t r=0; r<4; r++)
{
U256i tmp = { globalSumsReg[r] };
for (size_t k=0; k<8; k++) { globalSums[r*8+k] += tmp.b[k]; }
}
#endif
// we update the remaining local totals
for (uint64_t i=steps*255; i<nbRows; i++)
{
UpdateLocalSums<1>(localSums, bitmap, k);
}
// we update the global totals
U256i tmp = { localSums };
for (size_t k=0; k<32; k++) { globalSums[k] += tmp.a[k]; }
}
////////////////////////////////////////////////////////////////////////////////
void execute (
const char* name,
void (*fct)(uint64_t nbRows, const uint32_t* bitmap, uint64_t* globalSums),
size_t nbRuns,
uint64_t nbRows,
u_int32_t* bitmap
)
{
uint64_t sums[32];
double timeTotal=0;
double cycleTotal=0;
double timeTotal2=0;
double cycleTotal2=0;
uint64_t check=0;
for (size_t n=0; n<nbRuns; n++)
{
memset(sums,0,sizeof(sums));
// We want both time and cpu cycles information
milliseconds t0 = duration_cast< milliseconds >(system_clock::now().time_since_epoch());
uint64_t c0 = ReadTSC();
// We run the test
(*fct) (nbRows, bitmap, sums);
uint64_t c1 = ReadTSC();
milliseconds t1 = duration_cast< milliseconds >(system_clock::now().time_since_epoch());
timeTotal += (t1-t0).count();
cycleTotal += (double)(c1-c0) / nbRows;
timeTotal2 += (t1-t0).count() * (t1-t0).count();
cycleTotal2 += ((double)(c1-c0) / nbRows) * ((double)(c1-c0) / nbRows);
// We compute some dummy checksum
for (size_t k=0; k<32; k++) { check += sums[k]; }
}
printf ("%-21s | %5.0lf (%5.1lf) | %5.2lf (%4.2lf) | %.3lf | 0x%lx\n",
name,
timeTotal / nbRuns,
deviation (nbRuns, timeTotal2, timeTotal),
cycleTotal/nbRuns,
deviation (nbRuns, cycleTotal2, cycleTotal),
check,
nbRows * cycleTotal / timeTotal / 1000000.0
);
}
////////////////////////////////////////////////////////////////////////////////
int main(int argc, char **argv)
{
// We set rows number as 2^n where n is the provided argument
// For simplification, we assume that the rows number is a multiple of 32
uint64_t nbRows = 1ULL << (argc>1 ? atoi(argv[1]) : 28);
size_t nbRuns = argc>2 ? atoi(argv[2]) : 10;
// We build an bitmap of size nbRows*32
uint32_t* bitmap = new uint32_t[nbRows];
if (bitmap==nullptr)
{
fprintf(stderr, "unable to allocate the bitmap\n");
exit(1);
}
// We fill the bitmap with random values
srand(time(nullptr));
for (uint64_t i=0; i<nbRows; i++) { bitmap[i] = rand() & 0xFFFFFFFF; }
printf ("\n");
printf ("nbRows=%ld nbRuns=%ld\n", nbRows, nbRuns);
printf ("------------------------------------------------------------------------------------------------------------\n");
printf ("name | time in msec : mean (sd) | cycles/row : mean (sd) | frequency in GHz | checksum\n");
printf ("------------------------------------------------------------------------------------------------------------\n");
// We launch the benchmark
execute ("naive (transpo) ", test_transpose_popcnt_naive, nbRuns, nbRows, bitmap);
execute ("naive (row by row)", test_update_row_by_row_naive, nbRuns, nbRows, bitmap);
execute ("AVX2 (transpo) ", test_transpose_popcnt_avx2, nbRuns, nbRows, bitmap);
execute ("AVX2 (row by row)", test_update_row_by_row_avx2, nbRuns, nbRows, bitmap);
printf ("\n");
// Some clean up
delete[] bitmap;
return EXIT_SUCCESS;
}
////////////////////////////////////////////////////////////////////////////////
__m256i expand_bits_to_bytes(uint32_t x)
{
__m256i xbcast = _mm256_set1_epi32(x);
// Each byte gets the source byte containing the corresponding bit
__m256i shufmask = _mm256_set_epi64x(
0x0303030303030303, 0x0202020202020202,
0x0101010101010101, 0x0000000000000000);
__m256i shuf = _mm256_shuffle_epi8(xbcast, shufmask);
__m256i andmask = _mm256_set1_epi64x(0x8040201008040201); // every 8 bits -> 8 bytes, pattern repeats.
__m256i isolated_inverted = _mm256_and_si256(shuf, andmask);
// Avoid an _mm256_add_epi8 thanks to Peter Cordes's comment
return _mm256_cmpeq_epi8(isolated_inverted, andmask);
}
////////////////////////////////////////////////////////////////////////////////
void sse_trans(char const *inp, char *out)
{
#define INP(x,y) inp[(x)*4 + (y)/8]
#define OUT(x,y) out[(y)*4 + (x)/8]
int rr, cc, i, h;
union { __m256i x; uint8_t b[32]; } tmp;
for (cc = 0; cc < 32; cc += 8)
{
for (i = 0; i < 32; ++i)
tmp.b[i] = INP(i, cc);
for (i = 8; i--; tmp.x = _mm256_slli_epi64(tmp.x, 1))
*(uint32_t*)&OUT(0, cc + i) = _mm256_movemask_epi8(tmp.x);
}
}
////////////////////////////////////////////////////////////////////////////////
double deviation (double n, double sum2, double sum) { return sqrt (sum2/n - (sum/n)*(sum/n)); }
Some remarks:
I used the Agner Fog's asmlib to have a function that returns CPU cycles
The compilation command is g++ -O3 -march=native ../Test.cpp -o ./Test -laelf64
The gcc version is 7.3.1
The CPU is Intel(R) Core(TM) i7-6700HQ CPU # 2.60GHz
I compute some dummy checksum to compare the results of the different tests
Now the results:
------------------------------------------------------------------------------------------------------------
name | time in msec : mean (sd) | cycles/row : mean (sd) | frequency in GHz | checksum
------------------------------------------------------------------------------------------------------------
naive (transpo) | 4548 ( 36.5) | 43.91 (0.35) | 2.592 | 0x9affeb5a6
naive (row by row) | 3033 ( 11.0) | 29.29 (0.11) | 2.592 | 0x9affeb5a6
AVX2 (transpo) | 767 ( 12.8) | 7.40 (0.12) | 2.592 | 0x9affeb5a6
AVX2 (row by row) | 130 ( 4.0) | 1.25 (0.04) | 2.591 | 0x9affeb5a6
So it seems that the "row by row" in AVX2 is the best so far.
Note that when I saw this result (less than 2 cycles per row), I made no more effort to optimize the AVX2 "transpose+popcount" method, which should be feasable by computing several popcounts in parallel (I may test it later).

I eventually wrote another implementation, following the high entropy SWAR approach proposed by Peter Cordes. This implementation is recursive and relies on C++ template specialization.
The global idea is to fill N-bit accumulators to their maximum without carry overflow (this is where recursion is used). When these accumulators are filled, we update the grand totals and we start again with new N-bit accumulators to fill until all rows have been processed.
Here is the code (see function test_SWAR_recursive):
#include <immintrin.h>
#include <cassert>
#include <chrono>
#include <cinttypes>
#include <cmath>
#include <cstdio>
#include <cstring>
using namespace std;
using namespace std::chrono;
// avoid the #include <asmlib.h>
extern "C" u_int64_t ReadTSC();
static double deviation (double n, double sum2, double sum) { return sqrt (sum2/n - (sum/n)*(sum/n)); }
////////////////////////////////////////////////////////////////////////////////
// Recursive SWAR approach (with template specialization)
////////////////////////////////////////////////////////////////////////////////
template<int DEPTH>
struct RecursiveSWAR
{
// Number of accumulators for current depth
static const int N = 1<<DEPTH;
// Array of N-bit accumulators
typedef __m256i Array[N];
// Magic numbers (0x55555555, 0x33333333, ...) computed recursively
static const u_int32_t MAGIC_NUMBER =
RecursiveSWAR<DEPTH-1>::MAGIC_NUMBER
* (1 + (1<<(1<<(DEPTH-1))))
/ (1 + (1<<(1<<(DEPTH+0))));
static void fillAccumulators (u_int32_t*& begin, const u_int32_t* end, Array accumulators)
{
// We reset the N-bit accumulators
for (int i=0; i<N; i++) { accumulators[i] = _mm256_setzero_si256(); }
// We check (only for depth big enough) that we have still rows to process
if (DEPTH>=3) if (begin>=end) { return; }
typename RecursiveSWAR<DEPTH-1>::Array accumulatorsMinusOne;
// We load a register with the mask
__m256i mask = _mm256_set1_epi32 (RecursiveSWAR<DEPTH-1>::MAGIC_NUMBER);
// We fill the N-bit accumulators to their maximum capacity without carry overflow
for (int i=0; i<N+1; i++)
{
// We fill (N-1)-bit accumulators recursively
RecursiveSWAR<DEPTH-1>::fillAccumulators (begin, end, accumulatorsMinusOne);
// We update the N-bit accumulators from the (N-1)-bit accumulators
for (int j=0; j<RecursiveSWAR<DEPTH-1>::N; j++)
{
// LOW part
accumulators[2*j+0] = _mm256_add_epi32 (
accumulators[2*j+0],
_mm256_and_si256 (
accumulatorsMinusOne[j],
mask
)
);
// HIGH part
accumulators[2*j+1] = _mm256_add_epi32 (
accumulators[2*j+1],
_mm256_and_si256 (
_mm256_srli_epi32 (
accumulatorsMinusOne[j],
RecursiveSWAR<DEPTH-1>::N
),
mask
)
);
}
}
}
};
// Template specialization for DEPTH=0
template<>
struct RecursiveSWAR<0>
{
static const int N = 1;
typedef __m256i Array[N];
static const u_int32_t MAGIC_NUMBER = 0x55555555;
static void fillAccumulators (u_int32_t*& begin, const u_int32_t* end, Array result)
{
// We just load 8 rows in the AVX2 register
result[0] = _mm256_loadu_si256 ((__m256i*)begin);
// We update the iterator
begin += 1*sizeof(__m256i)/sizeof(u_int32_t);
}
};
template<int DEPTH> struct TypeInfo { };
template<> struct TypeInfo<3> { typedef u_int8_t Type; };
template<> struct TypeInfo<4> { typedef u_int16_t Type; };
template<> struct TypeInfo<5> { typedef u_int32_t Type; };
unsigned char reversebits (unsigned char b)
{
return ((b * 0x80200802ULL) & 0x0884422110ULL) * 0x0101010101ULL >> 32;
}
void test_SWAR_recursive (uint64_t nbRows, const uint32_t* bitmap, uint32_t* globalSums)
{
static const int DEPTH = 4;
RecursiveSWAR<DEPTH>::Array accumulators;
uint32_t* begin = (uint32_t*) bitmap;
const uint32_t* end = bitmap + nbRows;
// We reset the grand totals
for (int i=0; i<32; i++) { globalSums[i] = 0; }
while (begin < end)
{
// We fill the N-bit accumulators to the maximum without overflow
RecursiveSWAR<DEPTH>::fillAccumulators (begin, end, accumulators);
// We update grand totals from the filled N-bit accumulators
for (int i=0; i<RecursiveSWAR<DEPTH>::N; i++)
{
int r = reversebits(i) >> (8-DEPTH);
u_int32_t* sums = globalSums+r;
TypeInfo<DEPTH>::Type* values = (TypeInfo<DEPTH>::Type*) (accumulators+i);
for (int j=0; j<8*(1<<(5-DEPTH)); j++)
{
sums[(j*RecursiveSWAR<DEPTH>::N) % 32] += values[j];
}
}
}
}
////////////////////////////////////////////////////////////////////////////////
void execute (
const char* name,
void (*fct)(uint64_t nbRows, const uint32_t* bitmap, uint32_t* globalSums),
size_t nbRuns,
uint64_t nbRows,
u_int32_t* bitmap
)
{
uint32_t sums[32];
double timeTotal=0;
double cycleTotal=0;
double timeTotal2=0;
double cycleTotal2=0;
uint64_t check=0;
for (size_t n=0; n<nbRuns; n++)
{
memset(sums,0,sizeof(sums));
// We want both time and cpu cycles information
milliseconds t0 = duration_cast< milliseconds >(system_clock::now().time_since_epoch());
uint64_t c0 = ReadTSC();
// We run the test
(*fct) (nbRows, bitmap, sums);
uint64_t c1 = ReadTSC();
milliseconds t1 = duration_cast< milliseconds >(system_clock::now().time_since_epoch());
timeTotal += (t1-t0).count();
cycleTotal += (double)(c1-c0) / nbRows;
timeTotal2 += (t1-t0).count() * (t1-t0).count();
cycleTotal2 += ((double)(c1-c0) / nbRows) * ((double)(c1-c0) / nbRows);
// We compute some dummy checksum
for (size_t k=0; k<32; k++) { check += (k+1)*sums[k]; }
}
printf ("%-21s | %5.0lf (%5.1lf) | %5.2lf (%5.3lf) | %.3lf | 0x%lx\n",
name,
timeTotal / nbRuns,
deviation (nbRuns, timeTotal2, timeTotal),
cycleTotal/nbRuns,
deviation (nbRuns, cycleTotal2, cycleTotal),
nbRows * cycleTotal / timeTotal / 1000000.0,
check/nbRuns
);
}
////////////////////////////////////////////////////////////////////////////////
int main(int argc, char **argv)
{
// We set rows number as 2^n where n is the provided argument
// For simplification, we assume that the rows number is a multiple of 32
uint64_t nbRows = 1ULL << (argc>1 ? atoi(argv[1]) : 28);
size_t nbRuns = argc>2 ? atoi(argv[2]) : 10;
// We build an bitmap of size nbRows*32
uint64_t actualNbRows = nbRows + 100000;
uint32_t* bitmap = (uint32_t*)_mm_malloc(sizeof(uint32_t)*actualNbRows, 256);
if (bitmap==nullptr)
{
fprintf(stderr, "unable to allocate the bitmap\n");
exit(1);
}
memset (bitmap, 0, sizeof(u_int32_t)*actualNbRows);
// We fill the bitmap with random values
// srand(time(nullptr));
for (uint64_t i=0; i<nbRows; i++) { bitmap[i] = rand() & 0xFFFFFFFF; }
printf ("\n");
printf ("nbRows=%ld nbRuns=%ld\n", nbRows, nbRuns);
printf ("------------------------------------------------------------------------------------------------------------\n");
printf ("name | time in msec : mean (sd) | cycles/row : mean (sd) | frequency in GHz | checksum\n");
printf ("------------------------------------------------------------------------------------------------------------\n");
// We launch the benchmark
execute ("AVX2 (SWAR rec) ", test_SWAR_recursive, nbRuns, nbRows, bitmap);
printf ("\n");
// Some clean up
_mm_free (bitmap);
return EXIT_SUCCESS;
}
The size of the accumulators is 2DEPTH in this code. Note that this implementation is valid up to DEPTH=5. For DEPTH=4, here are the performance results compared to the implementation of Peter Cordes (named high entropy SWAR):
The graph gives the number of cycles required to process a row (of 32 items) as a function of the number of rows of the matrix. As expected, the results are pretty similar since the main idea is the same. It is interesting to note the three parts of the graph:
constant value for log2(n)<=20
increasing value for log2(n) between 20 and 22
constant value for log2(n)>=22
I guess that CPU caches properties can explain this behaviour.

How to use cv::parallel_for_ for execution time reduction

I created an image processing algorithm using OpenCV and currently I'm trying to improve the time efficiency of my own, simple function which is similar to LUT, but with interpolation between values (double calibRI::corr(double)).
I optimized the pixel loop according to the OpenCV docs.
Non parallel function (calib(cv::Mat) -an object of calibRI functor class) takes about 0.15s. I decided to use cv::parallel_for_ to make it shorter.
First I implemented it as image tiling -according to >> this document. The time was reduced to 0.12s (4 threads).
virtual void operator()(const cv::Range& range) const
{
for(int i = range.start; i < range.end; i++)
{
// divide image in 'thr' number of parts and process simultaneously
cv::Rect roi(0, (img.rows/thr)*i, img.cols, img.rows/thr);
cv::Mat in = img(roi);
cv::Mat out = retVal(roi);
out = calib(in); //loops over all pixels and does out[u,v]=calibRI::corr(in[u,v])
}
I though that running my function in parallel for subimages/tiles/ROIs is not yet optimal, so I implemented it as below:
template <typename T>
class ParallelPixelLoop : public cv::ParallelLoopBody
{
typedef boost::function<T(T)> pixelProcessingFuntionPtr;
private:
cv::Mat& image; //source and result image (to be overwritten)
bool cont; //if the image is continuous
size_t rows;
size_t cols;
size_t threads;
std::vector<cv::Range> ranges;
pixelProcessingFuntionPtr pixelProcessingFunction; //pixel modif. function
public:
ParallelPixelLoop(cv::Mat& img, pixelProcessingFuntionPtr fun, size_t thr = 4)
: image(img), cont(image.isContinuous()), rows(img.rows), cols(img.cols), pixelProcessingFunction(fun), threads(thr)
{
int groupSize = 1;
if (cont) {
cols *= rows;
rows = 1;
groupSize = ceil( cols / threads );
}
else {
groupSize = ceil( rows / threads );
}
int t = 0;
for(t=0; t<threads-1; ++t) {
ranges.push_back( cv::Range( t*groupSize, (t+1)*groupSize ) );
}
ranges.push_back( cv::Range( t*groupSize, rows<=1?cols:rows ) ); //last range must be to the end of image (ceil used before)
}
virtual void operator()(const cv::Range& range) const
{
for(int r = range.start; r < range.end; r++)
{
T* Ip = nullptr;
cv::Range ran = ranges.at(r);
if(cont) {
Ip = image.ptr<T>(0);
for (int j = ran.start; j < ran.end; ++j)
{
Ip[j] = pixelProcessingFunction(Ip[j]);
}
}
else {
for(int i = ran.start; i < ran.end; ++i)
{
Ip = image.ptr<T>(i);
for (int j = 0; j < cols; ++j)
{
Ip[j] = pixelProcessingFunction(Ip[j]);
}
}
}
}
}
};
Then I run it on 1280x1024 64FC1 image, on i5 processor, Win8, and get the time in range of 0.4s using the code below:
double t = cv::getTickCount();
ParallelPixelLoop<double> loop(V,boost::bind(&calibRI::corr,this,_1),4);
cv::parallel_for_(cv::Range(0,4),loop);
std::cout << "Exec time: " << (cv::getTickCount()-t)/cv::getTickFrequency() << "s\n";
I have no idea why is my implementation so much slower than iterating all the pixels in subimages... Is there a bug in my code or the OpenCV ROIs are optimized in some special way?
I do not think there is a time measurement error issue, as described here. I'm using OpenCV time functions.
Is there any other way to reduce the time of this function?
Thanks in advance!

Generally it's really hard to say why using cv::parallel_for failed to speed up whole process. One possibility is that the problem is not related to processing/multithreading, but to time measurement. About 2 months ago i tried to optimize this algorithm and i noticed strange thing - first time i use it, it takes x ms, but if use use it second, third, ... time (of course without restarting application) it takes about x/2 (or even x/3) ms. I'm not sure what causes this behaviour - most likely (in my opinion) it's causes by branch prediction - when code is executed first time branch predictor "learns" which paths are usually taken, so next time it can predict which branch to take(and usually the guess will be correct). You can read more about it here - it's really good question and it can open your eyes for some quite important thing.
So, in your situation i would try few things:
measure it many times - 100 or 1000 should be enough (if it takes 0.12-0.4s it won't take much time) and see whether the last version of you code still is the slowest one. So just replace your code with this:
double t = cv::getTickCount();
for (unsigned int i=0; i<1000; i++) {
ParallelPixelLoop loop(V,boost::bind(&calibRI::corr,this,_1),4);
cv::parallel_for_(cv::Range(0,4),loop);
}
std::cout << "Exec time: " << (cv::getTickCount()-t)/cv::getTickFrequency() << "s\n";
test it on bigger image. Maybe in your situation you just "don't need" 4 cores, but on bigger image 4 cores will make positive difference.
Use profiler (for example Very Sleepy) to see what part of your code is critical

Can/Should I run this code of a statistical application on a GPU?

I'm working on a statistical application containing approximately 10 - 30 million floating point values in an array.
Several methods performing different, but independent, calculations on the array in nested loops, for example:
Dictionary<float, int> noOfNumbers = new Dictionary<float, int>();
for (float x = 0f; x < 100f; x += 0.0001f) {
int noOfOccurrences = 0;
foreach (float y in largeFloatingPointArray) {
if (x == y) {
noOfOccurrences++;
}
}
noOfNumbers.Add(x, noOfOccurrences);
}
The current application is written in C#, runs on an Intel CPU and needs several hours to complete. I have no knowledge of GPU programming concepts and APIs, so my questions are:
Is it possible (and does it make sense) to utilize a GPU to speed up such calculations?
If yes: Does anyone know any tutorial or got any sample code (programming language doesn't matter)?

UPDATE GPU Version
__global__ void hash (float *largeFloatingPointArray,int largeFloatingPointArraySize, int *dictionary, int size, int num_blocks)
{
int x = (threadIdx.x + blockIdx.x * blockDim.x); // Each thread of each block will
float y; // compute one (or more) floats
int noOfOccurrences = 0;
int a;
while( x < size ) // While there is work to do each thread will:
{
dictionary[x] = 0; // Initialize the position in each it will work
noOfOccurrences = 0;
for(int j = 0 ;j < largeFloatingPointArraySize; j ++) // Search for floats
{ // that are equal
// to it assign float
y = largeFloatingPointArray[j]; // Take a candidate from the floats array
y *= 10000; // e.g if y = 0.0001f;
a = y + 0.5; // a = 1 + 0.5 = 1;
if (a == x) noOfOccurrences++;
}
dictionary[x] += noOfOccurrences; // Update in the dictionary
// the number of times that the float appears
x += blockDim.x * gridDim.x; // Update the position here the thread will work
}
}
This one I just tested for smaller inputs, because I am testing in my laptop. Nevertheless, it is working, but more tests are needed.
UPDATE Sequential Version
I just did this naive version that executes your algorithm for an array with 30,000,000 element in less than 20 seconds (including the time taken by function that generates the data).
This naive version first sorts your array of floats. Afterward, will go through the sorted array and check the number of times a given value appears in the array and then puts this value in a dictionary along with the number of times it has appeared.
You can use sorted map, instead of the unordered_map that I used.
Heres the code:
#include <stdio.h>
#include <stdlib.h>
#include "cuda.h"
#include <algorithm>
#include <string>
#include <iostream>
#include <tr1/unordered_map>
typedef std::tr1::unordered_map<float, int> Mymap;
void generator(float *data, long int size)
{
float LO = 0.0;
float HI = 100.0;
for(long int i = 0; i < size; i++)
data[i] = LO + (float)rand()/((float)RAND_MAX/(HI-LO));
}
void print_array(float *data, long int size)
{
for(long int i = 2; i < size; i++)
printf("%f\n",data[i]);
}
std::tr1::unordered_map<float, int> fill_dict(float *data, int size)
{
float previous = data[0];
int count = 1;
std::tr1::unordered_map<float, int> dict;
for(long int i = 1; i < size; i++)
{
if(previous == data[i])
count++;
else
{
dict.insert(Mymap::value_type(previous,count));
previous = data[i];
count = 1;
}
}
dict.insert(Mymap::value_type(previous,count)); // add the last member
return dict;
}
void printMAP(std::tr1::unordered_map<float, int> dict)
{
for(std::tr1::unordered_map<float, int>::iterator i = dict.begin(); i != dict.end(); i++)
{
std::cout << "key(string): " << i->first << ", value(int): " << i->second << std::endl;
}
}
int main(int argc, char** argv)
{
int size = 1000000;
if(argc > 1) size = atoi(argv[1]);
printf("Size = %d",size);
float data[size];
using namespace __gnu_cxx;
std::tr1::unordered_map<float, int> dict;
generator(data,size);
sort(data, data + size);
dict = fill_dict(data,size);
return 0;
}
If you have the library thrust installed in you machine your should use this:
#include <thrust/sort.h>
thrust::sort(data, data + size);
instead of this
sort(data, data + size);
For sure it will be faster.
Original Post
I'm working on a statistical application which has a large array
containing 10 - 30 millions of floating point values.
Is it possible (and does it make sense) to utilize a GPU to speed up
such calculations?
Yes, it is. A month ago, I ran an entirely Molecular Dynamic simulation on a GPU. One of the kernels, which calculated the force between pairs of particles, received as parameter 6 array each one with 500,000 doubles, for a total of 3 Millions doubles (22 MB).
So if you are planning to put 30 Million floating points, which is about 114 MB of global Memory, it will not be a problem.
In your case, can the number of calculations be an issue? Based on my experience with the Molecular Dynamic (MD), I would say no. The sequential MD version takes about 25 hours to complete while the GPU version took 45 Minutes. You said your application took a couple hours, also based in your code example it looks softer than the MD.
Here's the force calculation example:
__global__ void add(double *fx, double *fy, double *fz,
double *x, double *y, double *z,...){
int pos = (threadIdx.x + blockIdx.x * blockDim.x);
...
while(pos < particles)
{
for (i = 0; i < particles; i++)
{
if(//inside of the same radius)
{
// calculate force
}
}
pos += blockDim.x * gridDim.x;
}
}
A simple example of a code in CUDA could be the sum of two 2D arrays:
In C:
for(int i = 0; i < N; i++)
c[i] = a[i] + b[i];
In CUDA:
__global__ add(int *c, int *a, int*b, int N)
{
int pos = (threadIdx.x + blockIdx.x)
for(; i < N; pos +=blockDim.x)
c[pos] = a[pos] + b[pos];
}
In CUDA you basically took each for iteration and assigned to each thread,
1) threadIdx.x + blockIdx.x*blockDim.x;
Each block has an ID from 0 to N-1 (N the number maximum of blocks) and each block has a 'X' number of threads with an ID from 0 to X-1.
Gives you the for loop iteration that each thread will compute based on its ID and the block ID which the thread is in; the blockDim.x is the number of threads that a block has.
So if you have 2 blocks each one with 10 threads and N=40, the:
Thread 0 Block 0 will execute pos 0
Thread 1 Block 0 will execute pos 1
...
Thread 9 Block 0 will execute pos 9
Thread 0 Block 1 will execute pos 10
....
Thread 9 Block 1 will execute pos 19
Thread 0 Block 0 will execute pos 20
...
Thread 0 Block 1 will execute pos 30
Thread 9 Block 1 will execute pos 39
Looking at your current code, I have made this draft of what your code could look like in CUDA:
__global__ hash (float *largeFloatingPointArray, int *dictionary)
// You can turn the dictionary in one array of int
// here each position will represent the float
// Since x = 0f; x < 100f; x += 0.0001f
// you can associate each x to different position
// in the dictionary:
// pos 0 have the same meaning as 0f;
// pos 1 means float 0.0001f
// pos 2 means float 0.0002f ect.
// Then you use the int of each position
// to count how many times that "float" had appeared
int x = blockIdx.x; // Each block will take a different x to work
float y;
while( x < 1000000) // x < 100f (for incremental step of 0.0001f)
{
int noOfOccurrences = 0;
float z = converting_int_to_float(x); // This function will convert the x to the
// float like you use (x / 0.0001)
// each thread of each block
// will takes the y from the array of largeFloatingPointArray
for(j = threadIdx.x; j < largeFloatingPointArraySize; j += blockDim.x)
{
y = largeFloatingPointArray[j];
if (z == y)
{
noOfOccurrences++;
}
}
if(threadIdx.x == 0) // Thread master will update the values
atomicAdd(&dictionary[x], noOfOccurrences);
__syncthreads();
}
You have to use atomicAdd because different threads from different blocks may write/read noOfOccurrences concurrently, so you have to ensure mutual exclusion.
This is just one approach; you can even assign the iterations of the outer loop to the threads instead of the blocks.
Tutorials
The Dr Dobbs Journal series CUDA: Supercomputing for the masses by Rob Farmer is excellent and covers just about everything in its fourteen installments. It also starts rather gently and is therefore fairly beginner-friendly.
and anothers:
Volume I: Introduction to CUDA Programming
Getting started with CUDA
CUDA Resources List
Take a look on the last item, you will find many link to learn CUDA.
OpenCL: OpenCL Tutorials | MacResearch

I don't know much of anything about parallel processing or GPGPU, but for this specific example, you could save a lot of time by making a single pass over the input array rather than looping over it a million times. With large data sets you will usually want to do things in a single pass if possible. Even if you're doing multiple independent computations, if it's over the same data set you might get better speed doing them all in the same pass, as you'll get better locality of reference that way. But it may not be worth it for the increased complexity in your code.
In addition, you really don't want to add a small amount to a floating point number repetitively like that, the rounding error will add up and you won't get what you intended. I've added an if statement to my below sample to check if inputs match your pattern of iteration, but omit it if you don't actually need that.
I don't know any C#, but a single pass implementation of your sample would look something like this:
Dictionary<float, int> noOfNumbers = new Dictionary<float, int>();
foreach (float x in largeFloatingPointArray)
{
if (math.Truncate(x/0.0001f)*0.0001f == x)
{
if (noOfNumbers.ContainsKey(x))
noOfNumbers.Add(x, noOfNumbers[x]+1);
else
noOfNumbers.Add(x, 1);
}
}
Hope this helps.

Is it possible (and does it make sense) to utilize a GPU to speed up
such calculations?
Definitely YES, this kind of algorithm is typically the ideal candidate for massive data-parallelism processing, the thing GPUs are so good at.
If yes: Does anyone know any tutorial or got any sample code
(programming language doesn't matter)?
When you want to go the GPGPU way you have two alternatives : CUDA or OpenCL.
CUDA is mature with a lot of tools but is NVidia GPUs centric.
OpenCL is a standard running on NVidia and AMD GPUs, and CPUs too. So you should really favour it.
For tutorial you have an excellent series on CodeProject by Rob Farber : http://www.codeproject.com/Articles/Rob-Farber#Articles
For your specific use-case there is a lot of samples for histograms buiding with OpenCL (note that many are image histograms but the principles are the same).
As you use C# you can use bindings like OpenCL.Net or Cloo.
If your array is too big to be stored in the GPU memory, you can block-partition it and rerun your OpenCL kernel for each part easily.

In addition to the suggestion by the above poster use the TPL (task parallel library) when appropriate to run in parallel on multiple cores.
The example above could use Parallel.Foreach and ConcurrentDictionary, but a more complex map-reduce setup where the array is split into chunks each generating an dictionary which would then be reduced to a single dictionary would give you better results.
I don't know whether all your computations map correctly to the GPU capabilities, but you'll have to use a map-reduce algorithm anyway to map the calculations to the GPU cores and then reduce the partial results to a single result, so you might as well do that on the CPU before moving on to a less familiar platform.

I am not sure whether using GPUs would be a good match given that
'largerFloatingPointArray' values need to be retrieved from memory. My understanding is that GPUs are better suited for self contained calculations.
I think turning this single process application into a distributed application running on many systems and tweaking the algorithm should speed things up considerably, depending how many systems are available.
You can use the classic 'divide and conquer' approach. The general approach I would take is as follows.
Use one system to preprocess 'largeFloatingPointArray' into a hash table or a database. This would be done in a single pass. It would use floating point value as the key, and the number of occurrences in the array as the value. Worst case scenario is that each value only occurs once, but that is unlikely. If largeFloatingPointArray keeps changing each time the application is run then in-memory hash table makes sense. If it is static, then the table could be saved in a key-value database such as Berkeley DB. Let's call this a 'lookup' system.
On another system, let's call it 'main', create chunks of work and 'scatter' the work items across N systems, and 'gather' the results as they become available. E.g a work item could be as simple as two numbers indicating the range that a system should work on. When a system completes the work, it sends back array of occurrences and it's ready to work on another chunk of work.
The performance is improved because we do not keep iterating over largeFloatingPointArray. If lookup system becomes a bottleneck, then it could be replicated on as many systems as needed.
With large enough number of systems working in parallel, it should be possible to reduce the processing time down to minutes.
I am working on a compiler for parallel programming in C targeted for many-core based systems, often referred to as microservers, that are/or will be built using multiple 'system-on-a-chip' modules within a system. ARM module vendors include Calxeda, AMD, AMCC, etc. Intel will probably also have a similar offering.
I have a version of the compiler working, which could be used for such an application. The compiler, based on C function prototypes, generates C networking code that implements inter-process communication code (IPC) across systems. One of the IPC mechanism available is socket/tcp/ip.
If you need help in implementing a distributed solution, I'd be happy to discuss it with you.
Added Nov 16, 2012.
I thought a little bit more about the algorithm and I think this should do it in a single pass. It's written in C and it should be very fast compared with what you have.
/*
* Convert the X range from 0f to 100f in steps of 0.0001f
* into a range of integers 0 to 1 + (100 * 10000) to use as an
* index into an array.
*/
#define X_MAX (1 + (100 * 10000))
/*
* Number of floats in largeFloatingPointArray needs to be defined
* below to be whatever your value is.
*/
#define LARGE_ARRAY_MAX (1000)
main()
{
int j, y, *noOfOccurances;
float *largeFloatingPointArray;
/*
* Allocate memory for largeFloatingPointArray and populate it.
*/
largeFloatingPointArray = (float *)malloc(LARGE_ARRAY_MAX * sizeof(float));
if (largeFloatingPointArray == 0) {
printf("out of memory\n");
exit(1);
}
/*
* Allocate memory to hold noOfOccurances. The index/10000 is the
* the floating point number. The contents is the count.
*
* E.g. noOfOccurances[12345] = 20, means 1.2345f occurs 20 times
* in largeFloatingPointArray.
*/
noOfOccurances = (int *)calloc(X_MAX, sizeof(int));
if (noOfOccurances == 0) {
printf("out of memory\n");
exit(1);
}
for (j = 0; j < LARGE_ARRAY_MAX; j++) {
y = (int)(largeFloatingPointArray[j] * 10000);
if (y >= 0 && y <= X_MAX) {
noOfOccurances[y]++;
}
}
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js