The following code operates on two std::vectors v1 and v2, each containing multiple 128-element vectors. Loops through the outer vectors (using i1 and i2) contain an inner loop, designed to limit the combinations of i1 and i2 for which further complex processing is performed. Around 99.9% of the combinations are filtered out.
Unfortunately the filtering loop is a major bottleneck in my program - profiling shows that 26% of the entire run time is spent on the line if(a[k] + b[k] > LIMIT).
const vector<vector<uint16_t>> & v1 = ...
const vector<vector<uint16_t>> & v2 = ...
for(size_t i1 = 0; i1 < v1.size(); ++i1) { //v1.size() and v2.size() about 20000
for(size_t i2 = 0; i2 < v2.size(); ++i2) {
const vector<uint16_t> & a = v1[i1];
const vector<uint16_t> & b = v2[i2];
bool good = true;
for(std::size_t k = 0; k < 128; ++k) {
if(a[k] + b[k] > LIMIT) { //LIMIT is a const uint16_t: approx 16000
good = false;
break;
}
}
if(!good) continue;
// Further processing involving i1 and i2
}
}
I think the performance of this code could be improved by increasing memory locality, plus perhaps vectorizing. Any suggestions on how to do this, or on other improvements that could be made?
You could apply SIMD to the inner loop:
bool good = true;
for(std::size_t k = 0; k < 128; ++k) {
if(a[k] + b[k] > LIMIT) { //LIMIT is a const uint16_t: approx 16000
good = false;
break;
}
as follows:
#include <emmintrin.h> // SSE2 intrinsics
#include <limits.h> // SHRT_MIN
// ...
// some useful constants - declare these somewhere before the outermost loop
const __m128i vLIMIT = _mm_set1_epi16(LIMIT + SHRT_MIN); // signed version of LIMIT
const __m128i vOFFSET = _mm_set1_epi16(SHRT_MIN); // offset for uint16_t -> int16_t conversion
// ...
bool good = true;
for(std::size_t k = 0; k < 128; k += 8) {
__m128i v, va, vb; // iterate through a, b, 8 elements at a time
int mask;
va = _mm_loadu_si128(&a[k]); // get 8 elements from a[k], b[k]
vb = _mm_loadu_si128(&b[k]);
v = _mm_add_epi16(va, vb); // add a and b vectors
v = _mm_add_epi16(v, vOFFSET); // subtract 32768 to make signed
v = _mm_cmpgt_epi16(v, vLIMIT); // compare against LIMIT
mask = _mm_maskmove_epi8(v); // get comparison results as 16 bit mask
if (mask != 0) { // if any value exceeded limit
good = false; // clear good flag and exit loop
break;
}
Warning: untested code - may need debugging, but the general approach should be sound.
You've got the most efficient access pattern for v1, but you are sequentially scanning through all of v2 for each iteration of the outer loop. This is very inefficient, because v2 access will continually cause (L2 and probably also L3) cache misses.
A better access pattern is to increase the loop nesting, so that outer loops stride through v1 and v2, and inner loops process elements within a subsegment of both v1 and v2 that's small enough to fit in cache.
Basically, instead of
for(size_t i1 = 0; i1 < v1.size(); ++i1) { //v1.size() and v2.size() about 20000
for(size_t i2 = 0; i2 < v2.size(); ++i2) {
Do
for(size_t i2a = 0; i2a < v2.size(); i2a += 32) {
for(size_t i1 = 0; i1 < v1.size(); ++i1) {
for(size_t i2 = i2a; i2 < v2.size() && i2 < i2a + 32; ++i2) {
Or
size_t i2a = 0;
// handle complete blocks
for(; i2a < v2.size() - 31; i2a += 32) {
for(size_t i1 = 0; i1 < v1.size(); ++i1) {
for(size_t i2 = i2a; i2 < i2a + 32; ++i2) {
}
}
}
// handle leftover partial block
for(size_t i1 = 0; i1 < v1.size(); ++i1) {
for(size_t i2 = i2a; i2 < v2.size(); ++i2) {
}
}
This way, a chunk of 32 * 128 * sizeof (uint16_t) bytes, or 8kB, will be loaded from v2 into cache, and then reused 20,000 times.
This improvement is orthogonal to SIMD (SSE) vectorization. It will interact with thread-based parallelism, but probably in a good way.
First, One simple optimization can be this, but the compiler could do this by itself so i'm not sure how much it could improve:
for(std::size_t k = 0; k < 128 && good; ++k)
{
good = a[k] + b[k] <= LIMIT;
}
Second, i think it could better to keep the good result in a second vector because any
processing involved with i1 and i2 could break the CPU cache.
Third, and this could be major optimization, i think you can rewrite the second for loop as this:
for(size_t i2 = i1; i2 < v2.size(); ++i2) since you are using + operations for a and b vectors which is commutative so the result of i1 and i2 will be the same as i2 and i1.
For this you need to have the v1 and v2 the same size. If the size is different you need to write the iteration in a different way.
Fort, As far as i can see you are processing two matrices, it would be better to keep a vector of elements rather than a vector of vectors.
Hope this helps.
Razvan.
A few suggestions:
As suggested in the comments, replace the inner 128 element vector with an array for better memory locality.
This code looks highly parallelizable, have you tried that? You could split the combinations for filtering across all available cores and then rebalance the collected work and split the processing across all cores.
I implemented a version using arrays for the inner 128 elements, PPL for parallelization (requires VS 2012 or higher) and a bit of SSE code for the filtering and got a pretty significant speedup. Depending on what exactly the 'further processing' involves there may be benefits to structuring things slightly differently (in this example I don't rebalance the work after filtering for example).
Update: I implemented the cache blocking suggested by Ben Voigt and got a bit more of a speed up.
#include <vector>
#include <array>
#include <random>
#include <limits>
#include <cstdint>
#include <iostream>
#include <numeric>
#include <chrono>
#include <iterator>
#include <ppl.h>
#include <immintrin.h>
using namespace std;
using namespace concurrency;
namespace {
const int outerVecSize = 20000;
const int innerVecSize = 128;
const int LIMIT = 16000;
auto engine = default_random_engine();
};
typedef vector<uint16_t> InnerVec;
typedef array<uint16_t, innerVecSize> InnerArr;
template <typename Cont> void randomFill(Cont& c) {
// We want approx 0.1% to pass filter, the mean and standard deviation are chosen to get close to that
static auto dist = normal_distribution<>(LIMIT / 4.0, LIMIT / 4.6);
generate(begin(c), end(c), [] {
auto clamp = [](double x, double minimum, double maximum) { return min(max(minimum, x), maximum); };
return static_cast<uint16_t>(clamp(dist(engine), 0.0, numeric_limits<uint16_t>::max()));
});
}
void resizeInner(InnerVec& v) { v.resize(innerVecSize); }
void resizeInner(InnerArr& a) {}
template <typename Inner> Inner generateRandomInner() {
auto inner = Inner();
resizeInner(inner);
randomFill(inner);
return inner;
}
template <typename Inner> vector<Inner> generateRandomInput() {
auto outer = vector<Inner>(outerVecSize);
generate(begin(outer), end(outer), generateRandomInner<Inner>);
return outer;
}
void Report(const chrono::high_resolution_clock::duration elapsed, size_t in1Size, size_t in2Size,
const int passedFilter, const uint32_t specialValue) {
cout << passedFilter << "/" << in1Size* in2Size << " ("
<< 100.0 * (double(passedFilter) / double(in1Size * in2Size)) << "%) passed filter\n";
cout << specialValue << "\n";
cout << "Elapsed time = " << chrono::duration_cast<chrono::milliseconds>(elapsed).count() << "ms" << endl;
}
void TestOriginalVersion() {
cout << __FUNCTION__ << endl;
engine.seed();
const auto v1 = generateRandomInput<InnerVec>();
const auto v2 = generateRandomInput<InnerVec>();
int passedFilter = 0;
uint32_t specialValue = 0;
auto startTime = chrono::high_resolution_clock::now();
for (size_t i1 = 0; i1 < v1.size(); ++i1) { // v1.size() and v2.size() about 20000
for (size_t i2 = 0; i2 < v2.size(); ++i2) {
const vector<uint16_t>& a = v1[i1];
const vector<uint16_t>& b = v2[i2];
bool good = true;
for (std::size_t k = 0; k < 128; ++k) {
if (static_cast<int>(a[k]) + static_cast<int>(b[k])
> LIMIT) { // LIMIT is a const uint16_t: approx 16000
good = false;
break;
}
}
if (!good) continue;
// Further processing involving i1 and i2
++passedFilter;
specialValue += inner_product(begin(a), end(a), begin(b), 0);
}
}
auto endTime = chrono::high_resolution_clock::now();
Report(endTime - startTime, v1.size(), v2.size(), passedFilter, specialValue);
}
bool needsProcessing(const InnerArr& a, const InnerArr& b) {
static_assert(sizeof(a) == sizeof(b) && (sizeof(a) % 16) == 0, "Array size must be multiple of 16 bytes.");
static const __m128i mmLimit = _mm_set1_epi16(LIMIT);
static const __m128i mmLimitPlus1 = _mm_set1_epi16(LIMIT + 1);
static const __m128i mmOnes = _mm_set1_epi16(-1);
auto to_m128i = [](const uint16_t* p) { return reinterpret_cast<const __m128i*>(p); };
return equal(to_m128i(a.data()), to_m128i(a.data() + a.size()), to_m128i(b.data()), [&](const __m128i& a, const __m128i& b) {
// avoid overflow due to signed compare by clamping sum to LIMIT + 1
const __m128i clampSum = _mm_min_epu16(_mm_adds_epu16(a, b), mmLimitPlus1);
return _mm_test_all_zeros(_mm_cmpgt_epi16(clampSum, mmLimit), mmOnes);
});
}
void TestArrayParallelVersion() {
cout << __FUNCTION__ << endl;
engine.seed();
const auto v1 = generateRandomInput<InnerArr>();
const auto v2 = generateRandomInput<InnerArr>();
combinable<int> passedFilterCombinable;
combinable<uint32_t> specialValueCombinable;
auto startTime = chrono::high_resolution_clock::now();
const size_t blockSize = 64;
parallel_for(0u, v1.size(), blockSize, [&](size_t i) {
for (const auto& b : v2) {
const auto blockBegin = begin(v1) + i;
const auto blockEnd = begin(v1) + min(v1.size(), i + blockSize);
for (auto it = blockBegin; it != blockEnd; ++it) {
const InnerArr& a = *it;
if (!needsProcessing(a, b))
continue;
// Further processing involving a and b
++passedFilterCombinable.local();
specialValueCombinable.local() += inner_product(begin(a), end(a), begin(b), 0);
}
}
});
auto passedFilter = passedFilterCombinable.combine(plus<int>());
auto specialValue = specialValueCombinable.combine(plus<uint32_t>());
auto endTime = chrono::high_resolution_clock::now();
Report(endTime - startTime, v1.size(), v2.size(), passedFilter, specialValue);
}
int main() {
TestOriginalVersion();
TestArrayParallelVersion();
}
On my 8 core system I see a pretty good speedup, your results will vary depending on how many cores you have etc.
TestOriginalVersion
441579/400000000 (0.110395%) passed filter
2447300015
Elapsed time = 12525ms
TestArrayParallelVersion
441579/400000000 (0.110395%) passed filter
2447300015
Elapsed time = 657ms
Related
this is my first time using multi-threading to speed up a heavy calculation.
Background: The idea is to calculate a Kernel Covariance matrix, by reading a list of 3D points x_test and calculating the corresponding matrix, which has dimensions x_test.size() x x_test.size().
I already sped up the calculations by only calculating the lower triangluar matrix. Since all the calculations are independent from each other I tried to speed up the process (x_test.size() = 27000 in my case) by splitting the calculations of the matrix entries row-wise, assigning a range of rows to each thread.
On a single core the calculations took about 280 seconds each time, on 4 cores it took 270-290 seconds.
main.cpp
int main(int argc, char *argv[]) {
double sigma0sq = 1;
double lengthScale [] = {0.7633, 0.6937, 3.3307e+07};
const std::vector<std::vector<double>> x_test = parse2DCsvFile(inputPath);
/* Finding data slices of similar size */
//This piece of code works, each thread is assigned roughly the same number of matrix entries
int numElements = x_test.size()*x_test.size()/2;
const int numThreads = 4;
int elemsPerThread = numElements / numThreads;
std::vector<int> indices;
int j = 0;
for(std::size_t i=1; i<x_test.size()+1; ++i){
int prod = i*(i+1)/2 - j*(j+1)/2;
if (prod > elemsPerThread) {
i--;
j = i;
indices.push_back(i);
if(indices.size() == numThreads-1)
break;
}
}
indices.insert(indices.begin(), 0);
indices.push_back(x_test.size());
/* Spreding calculations to multiple threads */
std::vector<std::thread> threads;
for(std::size_t i = 1; i < indices.size(); ++i){
threads.push_back(std::thread(calculateKMatrixCpp, x_test, lengthScale, sigma0sq, i, indices.at(i-1), indices.at(i)));
}
for(auto & th: threads){
th.join();
}
return 0;
}
As you can see, each thread performs the following calculations on the data assigned to it:
void calculateKMatrixCpp(const std::vector<std::vector<double>> xtest, double lengthScale[], double sigma0sq, int threadCounter, int start, int stop){
char buffer[8192];
std::ofstream out("lower_half_matrix_" + std::to_string(threadCounter) +".csv");
out.rdbuf()->pubsetbuf(buffer, 8196);
for(int i = start; i < stop; ++i){
for(int j = 0; j < i+1; ++j){
double kij = seKernel(xtest.at(i), xtest.at(j), lengthScale, sigma0sq);
if (j!=0)
out << ',';
out << kij;
}
if(i!=xtest.size()-1 )
out << '\n';
}
out.close();
}
and
double seKernel(const std::vector<double> x1,const std::vector<double> x2, double lengthScale[], double sigma0sq) {
double sum(0);
for(std::size_t i=0; i<x1.size();i++){
sum += pow((x1.at(i)-x2.at(i))/lengthScale[i],2);
}
return sigma0sq*exp(-0.5*sum);
}
Aspects I considered
locking by simultaneous access to data vector -> I don't pass a reference to the threads, but a copy of the data. I know this is not optimal in terms of RAM usage, but as far as I know this should prevent simultaneous data access since every thread has its own copy
Output -> every thread writes its part of the lower triangular matrix to its own file. My task manager doesn't indicate a full SSD utilization in the slightest
Compiler and machine
Windows 11
GNU GCC Compiler
Code::Blocks (although I don't think that should be of importance)
There are many details that can be improved in your code, but I think the two biggest issues are:
using vectors or vectors, which leads to fragmented data;
writing each piece of data to file as soon as its value is computed.
The first point is easy to fix: use something like std::vector<std::array<double, 3>>. In the code below I use an alias to make it more readable:
using Point3D = std::array<double, 3>;
std::vector<Point3D> x_test;
The second point is slightly harder to address. I assume you wanted to write to the disk inside each thread because you couldn't manage to write to a shared buffer that you could then write to a file.
Here is a way to do exactly that:
void calculateKMatrixCpp(
std::vector<Point3D> const& xtest, Point3D const& lengthScale, double sigma0sq,
int threadCounter, int start, int stop, std::vector<double>& kMatrix
) {
// ...
double& kij = kMatrix[i * xtest.size() + j];
kij = seKernel(xtest[i], xtest[j], lengthScale, sigma0sq);
// ...
}
// ...
threads.push_back(std::thread(
calculateKMatrixCpp, x_test, lengthScale, sigma0sq,
i, indices[i-1], indices[i], std::ref(kMatrix)
));
Here, kMatrix is the shared buffer and represents the whole matrix you are trying to compute. You need to pass it to the thread via std::ref. Each thread will write to a different location in that buffer, so there is no need for any mutex or other synchronization.
Once you make these changes and try to write kMatrix to the disk, you will realize that this is the part that takes the most time, by far.
Below is the full code I tried on my machine, and the computation time was about 2 seconds whereas the writing-to-file part took 300 seconds! No amount of multithreading can speed that up.
If you truly want to write all that data to the disk, you may have some luck with file mapping. Computing the exact size needed should be easy enough if all values have the same number of digits, and it looks like you could write the values with multithreading. I have never done anything like that, so I can't really say much more about it, but it looks to me like the fastest way to write multiple gigabytes of memory to the disk.
#include <vector>
#include <thread>
#include <iostream>
#include <string>
#include <cmath>
#include <array>
#include <random>
#include <fstream>
#include <chrono>
using Point3D = std::array<double, 3>;
auto generateSampleData() -> std::vector<Point3D> {
static std::minstd_rand g(std::random_device{}());
std::uniform_real_distribution<> d(-1.0, 1.0);
std::vector<Point3D> data;
data.reserve(27000);
for (auto i = 0; i < 27000; ++i) {
data.push_back({ d(g), d(g), d(g) });
}
return data;
}
double seKernel(Point3D const& x1, Point3D const& x2, Point3D const& lengthScale, double sigma0sq) {
double sum = 0.0;
for (auto i = 0u; i < 3u; ++i) {
double distance = (x1[i] - x2[i]) / lengthScale[i];
sum += distance*distance;
}
return sigma0sq * std::exp(-0.5*sum);
}
void calculateKMatrixCpp(std::vector<Point3D> const& xtest, Point3D const& lengthScale, double sigma0sq, int threadCounter, int start, int stop, std::vector<double>& kMatrix) {
std::cout << "start of thread " << threadCounter << "\n" << std::flush;
for(int i = start; i < stop; ++i) {
for(int j = 0; j < i+1; ++j) {
double& kij = kMatrix[i * xtest.size() + j];
kij = seKernel(xtest[i], xtest[j], lengthScale, sigma0sq);
}
}
std::cout << "end of thread " << threadCounter << "\n" << std::flush;
}
int main() {
double sigma0sq = 1;
Point3D lengthScale = {0.7633, 0.6937, 3.3307e+07};
const std::vector<Point3D> x_test = generateSampleData();
/* Finding data slices of similar size */
//This piece of code works, each thread is assigned roughly the same number of matrix entries
int numElements = x_test.size()*x_test.size()/2;
const int numThreads = 4;
int elemsPerThread = numElements / numThreads;
std::vector<int> indices;
int j = 0;
for(std::size_t i = 1; i < x_test.size()+1; ++i){
int prod = i*(i+1)/2 - j*(j+1)/2;
if (prod > elemsPerThread) {
i--;
j = i;
indices.push_back(i);
if(indices.size() == numThreads-1)
break;
}
}
indices.insert(indices.begin(), 0);
indices.push_back(x_test.size());
auto start = std::chrono::system_clock::now();
std::vector<double> kMatrix(x_test.size() * x_test.size(), 0.0);
std::vector<std::thread> threads;
for (std::size_t i = 1; i < indices.size(); ++i) {
threads.push_back(std::thread(calculateKMatrixCpp, x_test, lengthScale, sigma0sq, i, indices[i - 1], indices[i], std::ref(kMatrix)));
}
for (auto& t : threads) {
t.join();
}
auto end = std::chrono::system_clock::now();
auto elapsed_seconds = std::chrono::duration<double>(end - start).count();
std::cout << "computation time: " << elapsed_seconds << "s" << std::endl;
start = std::chrono::system_clock::now();
constexpr int buffer_size = 131072;
char buffer[buffer_size];
std::ofstream out("matrix.csv");
out.rdbuf()->pubsetbuf(buffer, buffer_size);
for (int i = 0; i < x_test.size(); ++i) {
for (int j = 0; j < i + 1; ++j) {
if (j != 0) {
out << ',';
}
out << kMatrix[i * x_test.size() + j];
}
if (i != x_test.size() - 1) {
out << '\n';
}
}
end = std::chrono::system_clock::now();
elapsed_seconds = std::chrono::duration<double>(end - start).count();
std::cout << "writing time: " << elapsed_seconds << "s" << std::endl;
}
Okey I've wrote implementation with optimized formatting.
By using #Nelfeal code it was taking on my system around 250 seconds for the run to complete with write time taking the most by far. Or rather std::ofstream formatting taking most of the time.
I've written a C++20 version via std::format_to/format. It is a multi-threaded version that takes around 25-40 seconds to complete all the computations, formatting, and writing. If run in a single thread, it takes on my system around 70 seconds. Same performance should be achievable via fmt library on C++11/14/17.
Here is the code:
import <vector>;
import <thread>;
import <iostream>;
import <string>;
import <cmath>;
import <array>;
import <random>;
import <fstream>;
import <chrono>;
import <format>;
import <filesystem>;
using Point3D = std::array<double, 3>;
auto generateSampleData(Point3D scale) -> std::vector<Point3D>
{
static std::minstd_rand g(std::random_device{}());
std::uniform_real_distribution<> d(-1.0, 1.0);
std::vector<Point3D> data;
data.reserve(27000);
for (auto i = 0; i < 27000; ++i)
{
data.push_back({ d(g)* scale[0], d(g)* scale[1], d(g)* scale[2] });
}
return data;
}
double seKernel(Point3D const& x1, Point3D const& x2, Point3D const& lengthScale, double sigma0sq) {
double sum = 0.0;
for (auto i = 0u; i < 3u; ++i) {
double distance = (x1[i] - x2[i]) / lengthScale[i];
sum += distance * distance;
}
return sigma0sq * std::exp(-0.5 * sum);
}
void calculateKMatrixCpp(std::vector<Point3D> const& xtest, Point3D lengthScale, double sigma0sq, int threadCounter, int start, int stop, std::filesystem::path localPath)
{
using namespace std::string_view_literals;
std::vector<char> buffer;
buffer.reserve(15'000);
std::ofstream out(localPath);
std::cout << std::format("starting thread {}: from {} to {}\n"sv, threadCounter, start, stop);
for (int i = start; i < stop; ++i)
{
for (int j = 0; j < i; ++j)
{
double kij = seKernel(xtest[i], xtest[j], lengthScale, sigma0sq);
std::format_to(std::back_inserter(buffer), "{:.6g}, "sv, kij);
}
double kii = seKernel(xtest[i], xtest[i], lengthScale, sigma0sq);
std::format_to(std::back_inserter(buffer), "{:.6g}\n"sv, kii);
out.write(buffer.data(), buffer.size());
buffer.clear();
}
}
int main() {
double sigma0sq = 1;
Point3D lengthScale = { 0.7633, 0.6937, 3.3307e+07 };
const std::vector<Point3D> x_test = generateSampleData(lengthScale);
/* Finding data slices of similar size */
//This piece of code works, each thread is assigned roughly the same number of matrix entries
int numElements = x_test.size() * (x_test.size()+1) / 2;
const int numThreads = 3;
int elemsPerThread = numElements / numThreads;
std::vector<int> indices;
int j = 0;
for (std::size_t i = 1; i < x_test.size() + 1; ++i) {
int prod = i * (i + 1) / 2 - j * (j + 1) / 2;
if (prod > elemsPerThread) {
i--;
j = i;
indices.push_back(i);
if (indices.size() == numThreads - 1)
break;
}
}
indices.insert(indices.begin(), 0);
indices.push_back(x_test.size());
auto start = std::chrono::system_clock::now();
std::vector<std::thread> threads;
using namespace std::string_view_literals;
for (std::size_t i = 1; i < indices.size(); ++i)
{
threads.push_back(std::thread(calculateKMatrixCpp, std::ref(x_test), lengthScale, sigma0sq, i, indices[i - 1], indices[i], std::format("./matrix_{}.csv"sv, i-1)));
}
for (auto& t : threads)
{
t.join();
}
auto end = std::chrono::system_clock::now();
auto elapsed_seconds = std::chrono::duration<double>(end - start);
std::cout << std::format("total elapsed time: {}"sv, elapsed_seconds);
return 0;
}
Note: I used 6 digits of precision here as it is the default for std::ofstream. More digits means more writing time to disk and lower performance.
Since c++17 std library support parallel algorithm, I thought it would be the go-to option for us, but after comparing with tbb and openmp, I changed my mind, I found the std library is much slower.
By this post, I want to ask for professional advice about whether I should abandon the std library's parallel algorithm, and use tbb or openmp, thanks!
Env:
Mac OSX, Catalina 10.15.7
GNU g++-10
Benchmark code:
#include <algorithm>
#include <cmath>
#include <chrono>
#include <execution>
#include <iostream>
#include <tbb/parallel_for.h>
#include <vector>
const size_t N = 1000000;
double std_for() {
auto values = std::vector<double>(N);
size_t n_par = 5lu;
auto indices = std::vector<size_t>(n_par);
std::iota(indices.begin(), indices.end(), 0lu);
size_t stride = static_cast<size_t>(N / n_par) + 1;
std::for_each(
std::execution::par,
indices.begin(),
indices.end(),
[&](size_t index) {
int begin = index * stride;
int end = (index+1) * stride;
for (int i = begin; i < end; ++i) {
values[i] = 1.0 / (1 + std::exp(-std::sin(i * 0.001)));
}
});
double total = 0;
for (double value : values)
{
total += value;
}
return total;
}
double tbb_for() {
auto values = std::vector<double>(N);
tbb::parallel_for(
tbb::blocked_range<int>(0, values.size()),
[&](tbb::blocked_range<int> r) {
for (int i=r.begin(); i<r.end(); ++i) {
values[i] = 1.0 / (1 + std::exp(-std::sin(i * 0.001)));
}
});
double total = 0;
for (double value : values) {
total += value;
}
return total;
}
double omp_for()
{
auto values = std::vector<double>(N);
#pragma omp parallel for
for (int i=0; i<values.size(); ++i) {
values[i] = 1.0 / (1 + std::exp(-std::sin(i * 0.001)));
}
double total = 0;
for (double value : values) {
total += value;
}
return total;
}
double seq_for()
{
auto values = std::vector<double>(N);
for (int i=0; i<values.size(); ++i) {
values[i] = 1.0 / (1 + std::exp(-std::sin(i * 0.001)));
}
double total = 0;
for (double value : values) {
total += value;
}
return total;
}
void time_it(double(*fn_ptr)(), const std::string& fn_name) {
auto t1 = std::chrono::high_resolution_clock::now();
auto rez = fn_ptr();
auto t2 = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>( t2 - t1 ).count();
std::cout << fn_name << ", rez = " << rez << ", dur = " << duration << std::endl;
}
int main(int argc, char** argv) {
std::string op(argv[1]);
if (op == "std_for") {
time_it(&std_for, op);
} else if (op == "omp_for") {
time_it(&omp_for, op);
} else if (op == "tbb_for") {
time_it(&tbb_for, op);
} else if (op == "seq_for") {
time_it(&seq_for, op);
}
}
Compile options:
g++ --std=c++17 -O3 b.cpp -ltbb -I /usr/local/include -L /usr/local/lib -fopenmp
Results:
std_for, rez = 500106, dur = 11119
tbb_for, rez = 500106, dur = 7372
omp_for, rez = 500106, dur = 4781
seq_for, rez = 500106, dur = 27910
We can see that std_for is faster than seq_for(sequential for-loop), but it's still much slower than tbb and openmp.
UPDATE
As people suggested in comments, I run each for separately to be fair. The above code is updated, and results as follows,
>>> ./a.out seq_for
seq_for, rez = 500106, dur = 29885
>>> ./a.out tbb_for
tbb_for, rez = 500106, dur = 10619
>>> ./a.out omp_for
omp_for, rez = 500106, dur = 10052
>>> ./a.out std_for
std_for, rez = 500106, dur = 12423
And like ppl said, running the 4 versions in a row is not fair, compared to the previous results.
You already found that it matters what exactly is to be measured and how this is done. Your final task will certainty be quite different from this simple exercise and not entirely reflect the results found here.
Besides caching and warming-up that are affected by the sequence of doing tasks (you studied this explicitly in your updated question) there is also another issue in your example you should consider.
The actual parallel code is what matters. If this does not determine your performance/runtime than parallelization is not the right solution. But in your example you measure also resource allocation, initialization and final computation. If those drive the real costs in your final application, again, parallelization is not the silver bullet. Thus, for a fair comparison and to really measure the actual parallel code execution performance. I suggest to modify your code along this line (sorry, I don't have openmp installed) and continue your studies:
#include <algorithm>
#include <cmath>
#include <chrono>
#include <execution>
#include <iostream>
#include <tbb/parallel_for.h>
#include <vector>
const size_t N = 10000000; // #1
void std_for(std::vector<double>& values,
std::vector<size_t> const& indices,
size_t const stride) {
std::for_each(
std::execution::par,
indices.begin(),
indices.end(),
[&](size_t index) {
int begin = index * stride;
int end = (index+1) * stride;
for (int i = begin; i < end; ++i) {
values[i] = 1.0 / (1 + std::exp(-std::sin(i * 0.001)));
}
});
}
void tbb_for(std::vector<double>& values) {
tbb::parallel_for(
tbb::blocked_range<int>(0, values.size()),
[&](tbb::blocked_range<int> r) {
for (int i=r.begin(); i<r.end(); ++i) {
values[i] = 1.0 / (1 + std::exp(-std::sin(i * 0.001)));
}
});
}
/*
double omp_for()
{
auto values = std::vector<double>(N);
#pragma omp parallel for
for (int i=0; i<values.size(); ++i) {
values[i] = 1.0 / (1 + std::exp(-std::sin(i * 0.001)));
}
double total = 0;
for (double value : values) {
total += value;
}
return total;
}
*/
void seq_for(std::vector<double>& values)
{
for (int i=0; i<values.size(); ++i) {
values[i] = 1.0 / (1 + std::exp(-std::sin(i * 0.001)));
}
}
void time_it(void(*fn_ptr)(std::vector<double>&), const std::string& fn_name) {
std::vector<double> values = std::vector<double>(N);
auto t1 = std::chrono::high_resolution_clock::now();
fn_ptr(values);
auto t2 = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>( t2 - t1 ).count();
double total = 0;
for (double value : values) {
total += value;
}
std::cout << fn_name << ", res = " << total << ", dur = " << duration << std::endl;
}
void time_it_std(void(*fn_ptr)(std::vector<double>&, std::vector<size_t> const&, size_t const), const std::string& fn_name) {
std::vector<double> values = std::vector<double>(N);
size_t n_par = 5lu; // #2
auto indices = std::vector<size_t>(n_par);
std::iota(indices.begin(), indices.end(), 0lu);
size_t stride = static_cast<size_t>(N / n_par) + 1;
auto t1 = std::chrono::high_resolution_clock::now();
fn_ptr(values, indices, stride);
auto t2 = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>( t2 - t1 ).count();
double total = 0;
for (double value : values) {
total += value;
}
std::cout << fn_name << ", res = " << total << ", dur = " << duration << std::endl;
}
int main(int argc, char** argv) {
std::string op(argv[1]);
if (op == "std_for") {
time_it_std(&std_for, op);
// } else if (op == "omp_for") {
//time_it(&omp_for, op);
} else if (op == "tbb_for") {
time_it(&tbb_for, op);
} else if (op == "seq_for") {
time_it(&seq_for, op);
}
}
On my (slow) system this results in:
std_for, res = 5.00046e+06, dur = 66393
tbb_for, res = 5.00046e+06, dur = 51746
seq_for, res = 5.00046e+06, dur = 196156
I note here that the difference from seq_for to tbb_for has further increased. It is now ~4x while in your example it looks more like ~3x. And std_for is still about 20..30% slower than tbb_for.
However, there are further parameters. After increasing N (see #1) by a factor of 10 (ok, this is not very important) and n_par (see #2) from 5 to 100 (this is important) the results are
tbb_for, res = 5.00005e+07, dur = 486179
std_for, res = 5.00005e+07, dur = 479306
Here std_for is on-par with tbb_for!
Thus, to answer your question: I clearly would NOT discard c++17 std parallelization right away.
Perhaps you already know, but something I don't see mentioned here is the fact that (at least for gcc and clang) the PSTL is actually implemented using/backended by TBB, OpenMP (currently on clang, only, I believe), or a sequential version of it.
I'm guessing you're using libc++ since you are on Mac; as far as I know, for Linux at least, the LLVM distributions do not come with the PSTL enabled, and if building PSTL and libcxx/libcxxabi from source, it defaults to a sequential backend.
https://github.com/llvm/llvm-project/blob/main/pstl/CMakeLists.txt
https://github.com/gcc-mirror/gcc/blob/master/libstdc%2B%2B-v3/include/pstl/pstl_config.h
OpenMp is good for straight forward parallel codding.
On the other hand TBB use work-stealing mechanism which can give you
better performance for loops that are imbalance and nested.
I prefer TBB for complex and nested parallelism over OpenMP.(OpenMP
has a huge over-head for the nested parallelism)
I have a std::vector<std::vector<double>> that I am trying to convert to a single contiguous vector as fast as possible. My vector has a shape of roughly 4000 x 50.
The problem is, sometimes I need my output vector in column-major contiguous order (just concatenating the interior vectors of my 2d input vector), and sometimes I need my output vector in row-major contiguous order, effectively requiring a transpose.
I have found that a naive for loop is quite fast for conversion to a column-major vector:
auto to_dense_column_major_naive(std::vector<std::vector<double>> const & vec)
-> std::vector<double>
{
auto n_col = vec.size();
auto n_row = vec[0].size();
std::vector<double> out_vec(n_col * n_row);
for (size_t i = 0; i < n_col; ++i)
for (size_t j = 0; j < n_row; ++j)
out_vec[i * n_row + j] = vec[i][j];
return out_vec;
}
But obviously a similar approach is very slow for row-wise conversion, because of all of the cache misses. So for row-wise conversion, I thought a blocking strategy to promote cache locality might be my best bet:
auto to_dense_row_major_blocking(std::vector<std::vector<double>> const & vec)
-> std::vector<double>
{
auto n_col = vec.size();
auto n_row = vec[0].size();
std::vector<double> out_vec(n_col * n_row);
size_t block_side = 8;
for (size_t l = 0; l < n_col; l += block_side) {
for (size_t k = 0; k < n_row; k += block_side) {
for (size_t j = l; j < l + block_side && j < n_col; ++j) {
auto const &column = vec[j];
for (size_t i = k; i < k + block_side && i < n_row; ++i)
out_vec[i * n_col + j] = column[i];
}
}
}
return out_vec;
}
This is considerably faster than a naive loop for row-major conversion, but still almost an order of magnitude slower than naive column-major looping on my input size.
My question is, is there a faster approach to converting a (column-major) vector of vectors of doubles to a single contiguous row-major vector? I am struggling to reason about what the limit of speed of this code should be, and am thus questioning whether I'm missing something obvious. My assumption was that blocking would give me a much larger speedup then it appears to actually give.
The chart was generated using QuickBench (and somewhat verified with GBench locally on my machine) with this code: (Clang 7, C++20, -O3)
auto to_dense_column_major_naive(std::vector<std::vector<double>> const & vec)
-> std::vector<double>
{
auto n_col = vec.size();
auto n_row = vec[0].size();
std::vector<double> out_vec(n_col * n_row);
for (size_t i = 0; i < n_col; ++i)
for (size_t j = 0; j < n_row; ++j)
out_vec[i * n_row + j] = vec[i][j];
return out_vec;
}
auto to_dense_row_major_naive(std::vector<std::vector<double>> const & vec)
-> std::vector<double>
{
auto n_col = vec.size();
auto n_row = vec[0].size();
std::vector<double> out_vec(n_col * n_row);
for (size_t i = 0; i < n_col; ++i)
for (size_t j = 0; j < n_row; ++j)
out_vec[j * n_col + i] = vec[i][j];
return out_vec;
}
auto to_dense_row_major_blocking(std::vector<std::vector<double>> const & vec)
-> std::vector<double>
{
auto n_col = vec.size();
auto n_row = vec[0].size();
std::vector<double> out_vec(n_col * n_row);
size_t block_side = 8;
for (size_t l = 0; l < n_col; l += block_side) {
for (size_t k = 0; k < n_row; k += block_side) {
for (size_t j = l; j < l + block_side && j < n_col; ++j) {
auto const &column = vec[j];
for (size_t i = k; i < k + block_side && i < n_row; ++i)
out_vec[i * n_col + j] = column[i];
}
}
}
return out_vec;
}
auto to_dense_column_major_blocking(std::vector<std::vector<double>> const & vec)
-> std::vector<double>
{
auto n_col = vec.size();
auto n_row = vec[0].size();
std::vector<double> out_vec(n_col * n_row);
size_t block_side = 8;
for (size_t l = 0; l < n_col; l += block_side) {
for (size_t k = 0; k < n_row; k += block_side) {
for (size_t j = l; j < l + block_side && j < n_col; ++j) {
auto const &column = vec[j];
for (size_t i = k; i < k + block_side && i < n_row; ++i)
out_vec[j * n_row + i] = column[i];
}
}
}
return out_vec;
}
auto make_vecvec() -> std::vector<std::vector<double>>
{
std::vector<std::vector<double>> vecvec(50, std::vector<double>(4000));
std::mt19937 mersenne {2019};
std::uniform_real_distribution<double> dist(-1000, 1000);
for (auto &vec: vecvec)
for (auto &val: vec)
val = dist(mersenne);
return vecvec;
}
static void NaiveColumnMajor(benchmark::State& state) {
// Code before the loop is not measured
auto vecvec = make_vecvec();
for (auto _ : state) {
benchmark::DoNotOptimize(to_dense_column_major_naive(vecvec));
}
}
BENCHMARK(NaiveColumnMajor);
static void NaiveRowMajor(benchmark::State& state) {
// Code before the loop is not measured
auto vecvec = make_vecvec();
for (auto _ : state) {
benchmark::DoNotOptimize(to_dense_row_major_naive(vecvec));
}
}
BENCHMARK(NaiveRowMajor);
static void BlockingRowMajor(benchmark::State& state) {
// Code before the loop is not measured
auto vecvec = make_vecvec();
for (auto _ : state) {
benchmark::DoNotOptimize(to_dense_row_major_blocking(vecvec));
}
}
BENCHMARK(BlockingRowMajor);
static void BlockingColumnMajor(benchmark::State& state) {
// Code before the loop is not measured
auto vecvec = make_vecvec();
for (auto _ : state) {
benchmark::DoNotOptimize(to_dense_column_major_blocking(vecvec));
}
}
BENCHMARK(BlockingColumnMajor);
First of all, I cringe whenever something is qualified as "obviously". That word is often used to cover up a shortcoming in one's deductions.
But obviously a similar approach is very slow for row-wise conversion, because of all of the cache misses.
I'm not sure which is supposed to be obvious: that the row-wise conversion would be slow, or that it's slow because of cache misses. In either case, I find it not obvious. After all, there are two caching considerations here, aren't there? One for reading and one for writing? Let's look at the code from the reading perspective:
row_major_naive
for (size_t i = 0; i < n_col; ++i)
for (size_t j = 0; j < n_row; ++j)
out_vec[j * n_col + i] = vec[i][j];
Successive reads from vec are reads of contiguous memory: vec[i][0] followed by vec[i][1], etc. Very good for caching. So... cache misses? Slow? :) Maybe not so obvious.
Still, there is something to be gleaned from this. The claim is only wrong by claiming "obviously". There are non-locality issues, but they occur on the writing end. (Successive writes are offset by the space for 50 double values.) And empirical testing confirms the slowness. So maybe a solution is to flip on what is considered "obvious"?
row major flipped
for (size_t j = 0; j < n_row; ++j)
for (size_t i = 0; i < n_col; ++i)
out_vec[j * n_col + i] = vec[i][j];
All I did here was reverse the loops. Literally swap the order of those two lines of code then adjust the indentation. Now successive reads are potentially all over the place, as they read from different vectors. However, successive writes are now to contiguous blocks of memory. In one sense, we are in the same situation as before. But just like before, one should measure performance before assuming "fast" or "slow".
NaiveColumnMajor: 3.4 seconds
NaiveRowMajor: 7.7 seconds
FlippedRowMajor: 4.2 seconds
BlockingRowMajor: 4.4 seconds
BlockingColumnMajor: 3.9 seconds
Still slower than the naive column major conversion. However, this approach is not only faster than naive row major, but it's also faster than blocking row major. At least on my computer (using gcc -O3 and obviously :P iterating thousands of times). Mileage may vary. I don't know what the fancy profiling tools would say. The point is that sometimes simpler is better.
For funsies I did a test where the dimensions are swapped (changing from 50 vectors of 4000 elements to 4000 vectors of 50 elements). All methods got hurt this way, but "NaiveRowMajor" took the biggest hit. Worth noting is that "flipped row major" fell behind the blocking version. So, as one might expect, the best tool for the job depends on what exactly the job is.
NaiveColumnMajor: 3.7 seconds
NaiveRowMajor: 16 seconds
FlippedRowMajor: 5.6 seconds
BlockingRowMajor: 4.9 seconds
BlockingColumnMajor: 4.5 seconds
(By the way, I also tried the flipping trick on the blocking version. The change was small -- around 0.2 -- and opposite of flipping the naive version. That is, "flipped blocking" was slower than "blocking" for the question's 50-of-4000 vectors, but faster for my 4000-of-50 variant. Fine tuning might improve the results.)
Update: I did a little more testing with the flipping trick on the blocking version. This version has four loops, so "flipping" is not as straight-forward as when there are only two loops. It looks like swapping the order of the outer two loops is bad for performance, while swapping the inner two loops is good. (Initially, I had done both and gotten mixed results.) When I swapped just the inner loops, I measured 3.8 seconds (and 4.1 seconds in the 4000-of-50 scenario), making this the best row-major option in my tests.
row major hybrid
for (size_t l = 0; l < n_col; l += block_side)
for (size_t i = 0; i < n_row; ++i)
for (size_t j = l; j < l + block_side && j < n_col; ++j)
out_vec[i * n_col + j] = vec[j][i];
(After swapping the inner loops, I merged the middle loops.)
As for the theory behind this, I would guess that this amounts to trying to write one cache block at a time. Once a block is written, try to re-use vectors (the vec[j]) before they get ejected from the cache. After you exhaust those source vectors, move on to a new group of source vectors, again writing full blocks at a time.
I have just added two functions of parallel version of things
#include <ppl.h>
auto ppl_to_dense_column_major_naive(std::vector<std::vector<double>> const & vec)
-> std::vector<double>
{
auto n_col = vec.size();
auto n_row = vec[0].size();
std::vector<double> out_vec(n_col * n_row);
size_t vecLen = out_vec.size();
concurrency::parallel_for(size_t(0), vecLen, [&](size_t i)
{
size_t row = i / n_row;
size_t column = i % n_row;
out_vec[i] = vec[row][column];
});
return out_vec;
}
auto ppl_to_dense_row_major_naive(std::vector<std::vector<double>> const & vec)
-> std::vector<double>
{
auto n_col = vec.size();
auto n_row = vec[0].size();
std::vector<double> out_vec(n_col * n_row);
size_t vecLen = out_vec.size();
concurrency::parallel_for(size_t(0), vecLen, [&](size_t i)
{
size_t column = i / n_col;
size_t row = i % n_col;
out_vec[i] = vec[row][column];
});
return out_vec;
}
and additional benchmark codes for all of them
template< class _Fn, class ... Args >
auto callFncWithPerformance( std::string strFnName, _Fn toCall, Args&& ...args )
{
auto start = std::chrono::high_resolution_clock::now();
auto toRet = toCall( std::forward<Args>(args)... );
auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> diff = end - start;
std::cout << strFnName << ": " << diff.count() << " s" << std::endl;
return toRet;
}
template< class _Fn, class ... Args >
auto second_callFncWithPerformance(_Fn toCall, Args&& ...args)
{
std::string strFnName(typeid(toCall).name());
auto start = std::chrono::high_resolution_clock::now();
auto toRet = toCall(std::forward<Args>(args)...);
auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> diff = end - start;
std::cout << strFnName << ": " << diff.count() << " s";
return toRet;
}
#define MAKEVEC( FN, ... ) callFncWithPerformance( std::string( #FN ) , FN , __VA_ARGS__ )
int main()
{
//prepare vector
auto vec = make_vecvec();
std::vector< double > vecs[]
{
std::vector<double>(MAKEVEC(to_dense_column_major_naive, vec)),
std::vector<double>(MAKEVEC(to_dense_row_major_naive, vec)),
std::vector<double>(MAKEVEC(ppl_to_dense_column_major_naive, vec)),
std::vector<double>(MAKEVEC(ppl_to_dense_row_major_naive, vec)),
std::vector<double>(MAKEVEC(to_dense_row_major_blocking, vec)),
std::vector<double>(MAKEVEC(to_dense_column_major_blocking, vec)),
};
//system("pause");
return 0;
}
and here below result of these
Debug x64
to_dense_column_major_naive: 0.166859 s
to_dense_row_major_naive: 0.192488 s
ppl_to_dense_column_major_naive: 0.0557423 s
ppl_to_dense_row_major_naive: 0.0514017 s
to_dense_column_major_blocking: 0.118465 s
to_dense_row_major_blocking: 0.117732 s
Debug x86
to_dense_column_major_naive: 0.15242 s
to_dense_row_major_naive: 0.158746 s
ppl_to_dense_column_major_naive: 0.0534966 s
ppl_to_dense_row_major_naive: 0.0484076 s
to_dense_column_major_blocking: 0.111217 s
to_dense_row_major_blocking: 0.107727 s
Release x64
to_dense_column_major_naive: 0.000874 s
to_dense_row_major_naive: 0.0011973 s
ppl_to_dense_column_major_naive: 0.0054639 s
ppl_to_dense_row_major_naive: 0.0012034 s
to_dense_column_major_blocking: 0.0008023 s
to_dense_row_major_blocking: 0.0010282 s
Release x86
to_dense_column_major_naive: 0.0007156 s
to_dense_row_major_naive: 0.0012538 s
ppl_to_dense_column_major_naive: 0.0053351 s
ppl_to_dense_row_major_naive: 0.0013022 s
to_dense_column_major_blocking: 0.0008761 s
to_dense_row_major_blocking: 0.0012404 s
You are quite right, to parallel it is too small set of data.
And also it is too small works.
Although I will be post for someone else to reference these functions.
Similar to Fastest way to determine if an integer is between two integers (inclusive) with known sets of values, I wish to figure out if some value (most likely a double-precision floating point number) is between two other values (of the same type). The caveat is that I don't know already which value is larger than the other and I'm trying to determine if I should/how I should avoid using std::max/min. Here is some code I've tried to test this with already:
inline bool inRangeMult(double p, double v1, double v2) {
return (p - v1) * (p - v2) <= 0;
}
inline bool inRangeMinMax(double p, double v1, double v2) {
return p <= std::max(v1, v2) && p >= std::min(v1, v2);
}
inline bool inRangeComp(double p, double v1, double v2) {
return p < v1 != p < v2;
}
int main()
{
double a = 1e4;
std::clock_t start;
double duration;
bool res = false;
start = std::clock();
for (size_t i = 0; i < 2e4; ++i) {
for (size_t j = 0; j < 2e4; ++j) {
res = inRangeMult(a, i, j) ? res : !res;
}
}
duration = std::clock() - start;
std::cout << "InRangeMult: " << duration << std::endl;
start = std::clock();
for (size_t i = 0; i < 2e4; ++i) {
for (size_t j = 0; j < 2e4; ++j) {
res = inRangeMinMax(a, i, j) ? res : !res;
}
}
duration = std::clock() - start;
std::cout << "InRangeMinMax: " << duration << std::endl;
start = std::clock();
for (size_t i = 0; i < 2e4; ++i) {
for (size_t j = 0; j < 2e4; ++j) {
res = inRangeComp(a, i, j) ? res : !res;
}
}
duration = std::clock() - start;
std::cout << "InRangeComp: " << duration << std::endl;
std::cout << "Tricking the compiler by printing inane res: " << res << std::endl;
}
On most runs I'm finding that using std::min/max is still fastest, (latest run prints 346, 310, and 324 respectively), but I'm not 100% confident this is the best test setup, or that I've exhausted all of the reasonable implementations.
I'd appreciate anyone's input with a better profiling setup and/or better implementation.
EDIT: Updated code to make it less prone to compiler optimization.
2nd EDIT: Tweaked value of a and number of iterations. Results for one run are:
inRangeMult: 1337
inRangeMinMaz: 1127
inRangeComp: 729
The first test:
(p - v1) * (p - v2) <= 0
May result in overflow or underflow, due to the arithmetic operations.
The last one:
p < v1 != p < v2
Doesn't provide the same results as the others, which are inclusive in respect of the boundaries v1 and v2. It's an admittedly small difference, considering the range and precision of the type double, but it may be significant.
Another option is to explicitly expand the logic of the second function:
p <= std::max(v1, v2) && p >= std::min(v1, v2) // v1 and v2 are compared twice
Into something like this:
bool inRangeComp(double p, double v1, double v2) {
return v1 <= v2 // <- v1 and v2 are compared only once
? v1 <= p && p <= v2
: v2 <= p && p <= v1;
}
At least one compiler (gcc 8.2), HERE (thanks to jarod42 for the linked snippet), seems to prefer this version over the alternatives.
I have two vector<bool> A and B.
I want to compare them and count the number of elements that are equal:
For example:
A = {0,1,0,1}
B = {0,0,1,1}
Result will be equal to 2.
I can use _mm_cmpeq_epi8 but it is only compare 16 elements (i.e. I should convert 0 and 1 to char and then do the comparison).
Is it possible to compare 128 elements each time with SSE (or SIMD instructions)?
If you can either assume that vector<bool> is using contiguous byte-sized elements for storage, or if you can consider using something like vector<uint8_t> instead, then this example should give you a good starting point:
static size_t count_equal(const vector<uint8_t> &vec1, const vector<uint8_t> &vec2)
{
assert(vec1.size() == vec2.size()); // vectors must be same size
const size_t n = vec1.size();
const size_t max_block_size = 255 * 16; // max block size before possible overflow
__m128i vcount = _mm_setzero_si128();
size_t i, count = 0;
for (i = 0; i + 16 <= n; ) // for each block
{
size_t m = std::min(n, i + max_block_size);
for ( ; i + 16 <= m; i += 16) // for each vector in block
{
__m128i v1 = _mm_loadu_si128((__m128i *)&vec1[i]);
__m128i v2 = _mm_loadu_si128((__m128i *)&vec2[i]);
__m128i vcmp = _mm_cmpeq_epi8(v1, v2);
vcount = _mm_sub_epi8(vcount, vcmp);
}
vcount = _mm_sad_epu8(vcount, _mm_setzero_si128());
count += _mm_extract_epi16(vcount, 0) + _mm_extract_epi16(vcount, 4);
vcount = _mm_setzero_si128(); // update count from current block
}
vcount = _mm_sad_epu8(vcount, _mm_setzero_si128());
count += _mm_extract_epi16(vcount, 0) + _mm_extract_epi16(vcount, 4);
for ( ; i < n; ++i) // deal with any remaining partial vector
{
count += (vec1[i] == vec2[i]);
}
return count;
}
Note that this is using vector<uint8_t>. If you really have to use vector<bool> and can guarantee that the elements will always be contiguous and byte-sized then you'll just need to coerce the vector<bool> into a const uint8_t * or similar somehow.
Test harness:
#include <cassert>
#include <cstdlib>
#include <ctime>
#include <iostream>
#include <vector>
#include <emmintrin.h> // SSE2
using std::vector;
static size_t count_equal_ref(const vector<uint8_t> &vec1, const vector<uint8_t> &vec2)
{
assert(vec1.size() == vec2.size());
const size_t n = vec1.size();
size_t i, count = 0;
for (i = 0 ; i < n; ++i)
{
count += (vec1[i] == vec2[i]);
}
return count;
}
static size_t count_equal(const vector<uint8_t> &vec1, const vector<uint8_t> &vec2)
{
assert(vec1.size() == vec2.size()); // vectors must be same size
const size_t n = vec1.size();
const size_t max_block_size = 255 * 16; // max block size before possible overflow
__m128i vcount = _mm_setzero_si128();
size_t i, count = 0;
for (i = 0; i + 16 <= n; ) // for each block
{
size_t m = std::min(n, i + max_block_size);
for ( ; i + 16 <= m; i += 16) // for each vector in block
{
__m128i v1 = _mm_loadu_si128((__m128i *)&vec1[i]);
__m128i v2 = _mm_loadu_si128((__m128i *)&vec2[i]);
__m128i vcmp = _mm_cmpeq_epi8(v1, v2);
vcount = _mm_sub_epi8(vcount, vcmp);
}
vcount = _mm_sad_epu8(vcount, _mm_setzero_si128());
count += _mm_extract_epi16(vcount, 0) + _mm_extract_epi16(vcount, 4);
vcount = _mm_setzero_si128(); // update count from current block
}
vcount = _mm_sad_epu8(vcount, _mm_setzero_si128());
count += _mm_extract_epi16(vcount, 0) + _mm_extract_epi16(vcount, 4);
for ( ; i < n; ++i) // deal with any remaining partial vector
{
count += (vec1[i] == vec2[i]);
}
return count;
}
int main(int argc, char * argv[])
{
size_t n = 100;
if (argc > 1)
{
n = atoi(argv[1]);
}
vector<uint8_t> vec1(n);
vector<uint8_t> vec2(n);
srand((unsigned int)time(NULL));
for (size_t i = 0; i < n; ++i)
{
vec1[i] = rand() & 1;
vec2[i] = rand() & 1;
}
size_t n_ref = count_equal_ref(vec1, vec2);
size_t n_test = count_equal(vec1, vec2);
if (n_ref == n_test)
{
std::cout << "PASS" << std::endl;
}
else
{
std::cout << "FAIL: n_ref = " << n_ref << ", n_test = " << n_test << std::endl;
}
return 0;
}
Compile and run:
$ g++ -Wall -msse3 -O3 test.cpp && ./a.out
PASS
std::vector<bool> is a specialization of std::vector for the type bool. Although not specified by the C++ standard, in most implementations std::vector<bool> is made space efficient such that each of its element is a single bit instead of a bool.
The behaviour of std::vector<bool> is similar to its primarily template counterpart, except that:
std::vector<bool> does not necessarily store its element contiguously .
In order to expose its elements (i.e., the individual bits) std::vector<bool> uses a proxy class (i.e., std::vector<bool>::reference). Objects of class std::vector<bool>::reference are returned by std::vector<bool> subscript operator (i.e., operator[]) by value.
Accordingly, I don't think it's portable to use _mm_cmpeq_epi8 like functions since storage of a std::vector<bool> is implementation defined (i.e., not guaranteed contiguous).
An alternative but portable way is to use regular STL facilities like the example below:
std::vector<bool> A = {0,1,0,1};
std::vector<bool> B = {0,0,1,1};
std::vector<bool> C(A.size());
std::transform(A.begin(), A.end(), B.begin(), C.begin(), [](bool const &a, bool const &b) { return a == b;});
std::cout << std::count(C.begin(), C.end(), true) << std::endl;
Live Demo