fast compare array of unsigned numbers

fast compare array of unsigned numbers - c++

Consider this code:
constexpr size_t size = 32;
constexpr size_t count = 8;
using WordCode = unsigned;
template<typename T>
int CmpHashArray(const T *l,const T *r)
{
auto * l1 = reinterpret_cast<const __int32*>(l);
auto * r1 = reinterpret_cast<const __int32*>(r);
if(*l1 == *r1)
return 0;
if(*l1 < *r1)
return -1;
return 1;
}
int CmpHashArray2(const WordCode *l,const WordCode *r)
{
return memcmp(l, r, size);
}
int main(...)
{
WordCode a1[count], a2[count];
CmpHashArray(a1, a2);
CmpHashArray2(a1, a2);
}
is CmpHashArray have Undefined Behavior? Because with -O2 it takes 2 asm instructions instead of memcmp.
UPD:
Thanks for answer. As i see now, CmpHashArray can boil down to 1 compare if sizeof(array) <= 64bit
if this code can run faster memcmp?(on 64 and 32bit systems, crossplatform)
template<typename T,
size_t count,
typename std::enable_if<count*sizeof(T) % 64 == 0>::type
>
int CmpHashArray(const T *l,const T *r)
{
auto * l1 = reinterpret_cast<const __int64*>(l);
auto * r1 = reinterpret_cast<const __int64*>(r);
size_t iterCount = count*sizeof(T) / 64;
while(iterCount--) {
if(*l1 == *r1)
return 0;
if(*l1 < *r1)
return -1;
else
return 1;
++l1;
++r1;
}
}

Related

Convert array of bits to an array of bytes

I want to convert an array of bits (bool* bitArray) where the values are 1s and 0s into an array of bytes (unsigned char* byteArray) where the values at each index would be one byte.
For ex, index 0~7 in bitArray would go into byteArray[1].
How would I go about doing this? Assuming that I already have an array of bits (but the amount would be subject to change based on the incoming data).
I am not worried about having it divisible by 8 because I will just add padding at the end of the bitArray to make it divisible by 8.

Just just use bit shifts or a lookup array and and combine numbers with 1 bit set each with bitwise or for 8 bits at a time:
int main() {
bool input[] = {
false, false, false, true, true, true, false, false, false,
false, false, false, true, true, true, false, false, false,
false, false, false, true, true, true, false, false, false,
false, false, false, true, true, true, false, false, false,
};
constexpr auto len = sizeof(input) / sizeof(*input);
constexpr size_t outLen = ((len % 8 == 0) ? 0 : 1) + len / 8;
uint8_t out[outLen];
bool* inPos = input;
uint8_t* outPos = out;
size_t remaining = len;
// output bytes where there are all 8 bits available
for (; remaining >= 8; remaining -= 8, ++outPos)
{
uint8_t value = 0;
for (size_t i = 0; i != 8; ++i, ++inPos)
{
if (*inPos)
{
value |= (1 << (7 - i));
}
}
*outPos = value;
}
if (remaining != 0)
{
// output byte that requires padding
uint8_t value = 0;
for (size_t i = 0; i != remaining; ++i, ++inPos)
{
if (*inPos)
{
value |= (1 << (7 - i));
}
}
*outPos = value;
}
for (auto v : out)
{
std::cout << static_cast<int>(v) << '\n';
}
return 0;
}
The rhs of the |= operator could also be replaced with a lookup in the following array, if you consider this simpler to understand:
constexpr uint8_t Bits[8]
{
0b1000'0000,
0b0100'0000,
0b0010'0000,
0b0001'0000,
0b0000'1000,
0b0000'0100,
0b0000'0010,
0b0000'0001,
};
...
value |= Bits[i];
...

You should be using std::bitset for an array of bools, or std::vector<bool> if it's dynamically sized. And std::array for the array or again std::vector for dynamic size. I've only done static size below and conversion to and from.
Converting involves a lot of bit shifts and loops for something that should be a memcpy (on little endian or unsigned char types). The compiler output for -O2 is bad. -O3 removes the loop and to_array2 gets interesting. gcc nearly manages to optimize it, clang actually gets it down to movzx eax, word ptr [rdi]: https://godbolt.org/z/4chb8o81e
#include <array>
#include <bitset>
#include <climits>
template <typename T, std::size_t len>
constexpr std::bitset<sizeof(T) * CHAR_BIT * len> from_array(const std::array<T, len> &arr) {
std::bitset<sizeof(T) * CHAR_BIT * len> res;
std::size_t pos = 0;
for (auto x : arr) {
for(std::size_t i = 0; i < sizeof(T) * CHAR_BIT; ++i) {
res[pos++] = x & 1;
x >>= 1;
}
}
return res;
}
template <typename T, std::size_t len>
constexpr std::array<T, (len + sizeof(T) * CHAR_BIT - 1) / (sizeof(T) * CHAR_BIT)> to_array(const std::bitset<len> &bit) {
std::array<T, (len + sizeof(T) * CHAR_BIT - 1) / (sizeof(T) * CHAR_BIT)> res;
T mask = 1;
T t = 0;
std::size_t pos = 0;
for (std::size_t i = 0; i < len; ++i) {
if (bit[i]) t |= mask;
mask <<= 1;
if (mask == 0) {
mask = 1;
res[pos++] = t;
t = 0;
}
}
if constexpr (len % (sizeof(T) * CHAR_BIT) != 0) {
res[pos] = t;
}
return res;
}
std::bitset<16> from_array2(const std::array<unsigned char, 2> &arr) {
return from_array(arr);
}
std::array<unsigned short, 1> to_array2(const std::bitset<16> &bits) {
return to_array<unsigned short>(bits);
}
#include <iostream>
int main() {
std::array<unsigned char, 2> arr{0, 255};
std::bitset bits = from_array(arr);
std::cout << bits << std::endl;
std::bitset<16> bits2{0x1234};
std::array<unsigned short, 1> arr2 = to_array<unsigned short>(bits2);
std::cout << std::hex << arr2[0] << std::endl;
}

How to optimize my C++ OpenMp Matrix Multiplication code

I have written a C++ OpenMp Matrix Multiplication code that multiplies two 1000x1000 matrices.
So far I have gotten a 0.700 sec execution time using OpenMp but I want to see if there is other ways I can make it faster using OpenMp?
I appreciate any advice or tips you can give me.
Here is my code:
#include <iostream>
#include <time.h>
#include <omp.h>
using namespace std;
void Multiply()
{
//initialize matrices with random numbers
int aMatrix[1000][1000], i, j;
for( i = 0; i < 1000; ++i)
{for( j = 0; j < 1000; ++j)
{aMatrix[i][j] = rand();}
}
int bMatrix[1000][1000], i1, j2;
for( i1 = 0; i1 < 1000; ++i1)
{for( j2 = 0; j2 < 1000; ++j2)
{bMatrix[i1][j2] = rand();}
}
//Result Matrix
int product[1000][1000] = {0};
for (int row = 0; row < 1000; row++) {
for (int col = 0; col < 1000; col++) {
// Multiply the row of A by the column of B to get the row, column of product.
for (int inner = 0; inner < 1000; inner++) {
product[row][col] += aMatrix[row][inner] * bMatrix[inner][col];
}
}
}
}
int main() {
time_t begin, end;
time(&begin);
Multiply();
time(&end);
time_t elapsed = end - begin;
cout << ("Time measured: %ld seconds.\n", elapsed);
return 0;
}

Following things can be done for speedup:
Using OpenMP for parallelizing external loop, like you did (and like I also did in my following code). Or alternatively using std::async for multi-threading like it was used in another answer.
Transpose B matrix, this will help to increase L1 cache hits, because you will read from sequential memory each B column (or row in transposed variant).
Use vectorized SIMD instructions, this will allow to do several multiplications (and additions) within one CPU cycle. Often compilers do auto-vectorization of your loops well enough through SIMD instructions without your help, but I did it explicitly in my code.
Run several independent SIMD instructions within loop. This will help to occupy whole CPU pipeline of SIMD. I did so in my code by using four SIMD registers r0, r1, r2, r3. In compilers this is usually called loop unrolling.
Align your matrix starting address on 64-bytes boundary. This will help SIMD instructions to do fast aligned read/write.
Align starting address of each matrix row on 64-bytes boundary. I did this in my code by padding each row with zeros till multiple of 64-bytes. This also helps SIMD instructions to do fast aligned read/write.
In my following code I did all 1. - 6. steps above. Memory 64-bytes alignment I did through implementing AlignmentAllocator that was used in std::vector. Also I did time measurements for float/double/int.
On my old 4-core laptop I got following time measurements for the case of 1000x1000 matrix multiplying by 1000x1000:
float: time 0.1569 sec
double: time 0.3168 sec
int: time 0.1565 sec
To compare my hardware capabilities I did measurements of another answer of #doug for the case of int:
Threads w transpose 0.2164 secs.
As one can see my solution is 1.4x times faster that the other answer, I guess due to memory 64-bytes alignment and maybe due to using explicit SIMD (instead of relying on compiler auto-vectorization of a loop).
To compile my program, don't forget to add -fopenmp -lgomp options (for OpenMP support) and -march=native -O3 -std=c++20 options (for SIMD support, optimizations and standard) if you're compiling under GCC/CLang, while MSVC I guess adds OpenMP automatically and doesn't need any special options (use /O2 /GL /std:c++latest for optimizations and standard in MSVC).
In my code I only implemented SSE2/SSE4/AVX/AVX2 instructions for SIMD, if you have more powerful machine you may tell me and I implement also FMA/AVX-512, they will give even twice more speed boost.
My Mul() function is quite generic, it is templated, and you just pass pointers to matrices and row/col count, so your matrices may be stored on calling side in any way (through std::vector or std::array or plain 2D array).
At start of Run() function you may change number of rows and columns if you need a bigger test. Notice that all my functions support any rows and columns, you may even multiply matrix of size 1234x2345 by 2345x3456.
Try it online!
#include <cstdint>
#include <cstring>
#include <stdexcept>
#include <iostream>
#include <iomanip>
#include <vector>
#include <memory>
#include <string>
#include <immintrin.h>
#define USE_OPENMP 1
#define ASSERT_MSG(cond, msg) { if (!(cond)) throw std::runtime_error("Assertion (" #cond ") failed at line " + std::to_string(__LINE__) + "! Msg '" + std::string(msg) + "'."); }
#define ASSERT(cond) ASSERT_MSG(cond, "")
#if defined(_MSC_VER)
#define IS_MSVC 1
#else
#define IS_MSVC 0
#endif
#if USE_OPENMP
#include <omp.h>
#endif
template <typename T, std::size_t N>
class AlignmentAllocator {
public:
typedef T value_type;
typedef std::size_t size_type;
typedef std::ptrdiff_t difference_type;
typedef T * pointer;
typedef const T * const_pointer;
typedef T & reference;
typedef const T & const_reference;
public:
inline AlignmentAllocator() throw() {}
template <typename T2> inline AlignmentAllocator(const AlignmentAllocator<T2, N> &) throw() {}
inline ~AlignmentAllocator() throw() {}
inline pointer adress(reference r) { return &r; }
inline const_pointer adress(const_reference r) const { return &r; }
inline pointer allocate(size_type n);
inline void deallocate(pointer p, size_type);
inline void construct(pointer p, const value_type & wert);
inline void destroy(pointer p) { p->~value_type(); }
inline size_type max_size() const throw() { return size_type(-1) / sizeof(value_type); }
template <typename T2> struct rebind { typedef AlignmentAllocator<T2, N> other; };
bool operator!=(const AlignmentAllocator<T, N> & other) const { return !(*this == other); }
bool operator==(const AlignmentAllocator<T, N> & other) const { return true; }
};
template <typename T, std::size_t N>
inline typename AlignmentAllocator<T, N>::pointer AlignmentAllocator<T, N>::allocate(size_type n) {
#if IS_MSVC
auto p = (pointer)_aligned_malloc(n * sizeof(value_type), N);
#else
auto p = (pointer)std::aligned_alloc(N, n * sizeof(value_type));
#endif
ASSERT(p);
return p;
}
template <typename T, std::size_t N>
inline void AlignmentAllocator<T, N>::deallocate(pointer p, size_type) {
#if IS_MSVC
_aligned_free(p);
#else
std::free(p);
#endif
}
template <typename T, std::size_t N>
inline void AlignmentAllocator<T, N>::construct(pointer p, const value_type & wert) {
new (p) value_type(wert);
}
template <typename T>
using AlignedVector = std::vector<T, AlignmentAllocator<T, 64>>;
template <typename T>
struct RegT;
#ifdef __AVX__
template <> struct RegT<float> { static size_t constexpr bisize = 256; using type = __m256; static type zero() { return _mm256_setzero_ps(); } };
template <> struct RegT<double> { static size_t constexpr bisize = 256; using type = __m256d; static type zero() { return _mm256_setzero_pd(); } };
inline void MulAddReg(float const * a, float const * b, __m256 & c) {
c = _mm256_add_ps(c, _mm256_mul_ps(_mm256_load_ps(a), _mm256_load_ps(b)));
}
inline void MulAddReg(double const * a, double const * b, __m256d & c) {
c = _mm256_add_pd(c, _mm256_mul_pd(_mm256_load_pd(a), _mm256_load_pd(b)));
}
inline void StoreReg(float * dst, __m256 const & src) { _mm256_store_ps(dst, src); }
inline void StoreReg(double * dst, __m256d const & src) { _mm256_store_pd(dst, src); }
#else // SSE2
template <> struct RegT<float> { static size_t constexpr bisize = 128; using type = __m128; static type zero() { return _mm_setzero_ps(); } };
template <> struct RegT<double> { static size_t constexpr bisize = 128; using type = __m128d; static type zero() { return _mm_setzero_pd(); } };
inline void MulAddReg(float const * a, float const * b, __m128 & c) {
c = _mm_add_ps(c, _mm_mul_ps(_mm_load_ps(a), _mm_load_ps(b)));
}
inline void MulAddReg(double const * a, double const * b, __m128d & c) {
c = _mm_add_pd(c, _mm_mul_pd(_mm_load_pd(a), _mm_load_pd(b)));
}
inline void StoreReg(float * dst, __m128 const & src) { _mm_store_ps(dst, src); }
inline void StoreReg(double * dst, __m128d const & src) { _mm_store_pd(dst, src); }
#endif
#ifdef __AVX2__
template <> struct RegT<int32_t> { static size_t constexpr bisize = 256; using type = __m256i; static type zero() { return _mm256_setzero_si256(); } };
//template <> struct RegT<int64_t> { static size_t constexpr bisize = 256; using type = __m256i; static type zero() { return _mm256_setzero_si256(); } };
inline void MulAddReg(int32_t const * a, int32_t const * b, __m256i & c) {
c = _mm256_add_epi32(c, _mm256_mullo_epi32(_mm256_load_si256((__m256i*)a), _mm256_load_si256((__m256i*)b)));
}
//inline void MulAddReg(int64_t const * a, int64_t const * b, __m256i & c) {
// c = _mm256_add_epi64(c, _mm256_mullo_epi64(_mm256_load_si256((__m256i*)a), _mm256_load_si256((__m256i*)b)));
//}
inline void StoreReg(int32_t * dst, __m256i const & src) { _mm256_store_si256((__m256i*)dst, src); }
//inline void StoreReg(int64_t * dst, __m256i const & src) { _mm256_store_si256((__m256i*)dst, src); }
#else // SSE2
template <> struct RegT<int32_t> { static size_t constexpr bisize = 128; using type = __m128i; static type zero() { return _mm_setzero_si128(); } };
//template <> struct RegT<int64_t> { static size_t constexpr bisize = 128; using type = __m128i; static type zero() { return _mm_setzero_si128(); } };
inline void MulAddReg(int32_t const * a, int32_t const * b, __m128i & c) {
c = _mm_add_epi32(c, _mm_mullo_epi32(_mm_load_si128((__m128i*)a), _mm_load_si128((__m128i*)b)));
}
//inline void MulAddReg(int64_t const * a, int64_t const * b, __m128i & c) {
// c = _mm_add_epi64(c, _mm_mullo_epi64(_mm_load_si128((__m128i*)a), _mm_load_si128((__m128i*)b)));
//}
inline void StoreReg(int32_t * dst, __m128i const & src) { _mm_store_si128((__m128i*)dst, src); }
//inline void StoreReg(int64_t * dst, __m128i const & src) { _mm_store_si128((__m128i*)dst, src); }
#endif
template <typename T>
void Mul(T const * A0, size_t A_rows, size_t A_cols, T const * B0, size_t B_rows, size_t B_cols, T * C) {
size_t constexpr reg_cnt = RegT<T>::bisize / 8 / sizeof(T), block = 4 * reg_cnt;
ASSERT(A_cols == B_rows);
size_t const A_cols_aligned = (A_cols + block - 1) / block * block, B_rows_aligned = (B_rows + block - 1) / block * block;
// Copy aligned A
AlignedVector<T> Av(A_rows * A_cols_aligned);
for (size_t i = 0; i < A_rows; ++i)
std::memcpy(&Av[i * A_cols_aligned], &A0[i * A_cols], sizeof(Av[0]) * A_cols);
T const * A = Av.data();
// Transpose B
AlignedVector<T> Bv(B_cols * B_rows_aligned);
for (size_t j = 0; j < B_cols; ++j)
for (size_t i = 0; i < B_rows; ++i)
Bv[j * B_rows_aligned + i] = B0[i * B_cols + j];
T const * Bt = Bv.data();
ASSERT(uintptr_t(A) % 64 == 0 && uintptr_t(Bt) % 64 == 0);
ASSERT(uintptr_t(&A[A_cols_aligned]) % 64 == 0 && uintptr_t(&Bt[B_rows_aligned]) % 64 == 0);
// Multiply
#pragma omp parallel for
for (size_t i = 0; i < A_rows; ++i) {
// Aligned Reg storage
AlignedVector<T> Regs(block);
for (size_t j = 0; j < B_cols; ++j) {
T const * Arow = &A[i * A_cols_aligned + 0], * Btrow = &Bt[j * B_rows_aligned + 0];
using Reg = typename RegT<T>::type;
Reg r0 = RegT<T>::zero(), r1 = RegT<T>::zero(), r2 = RegT<T>::zero(), r3 = RegT<T>::zero();
size_t const k_hi = A_cols - A_cols % block;
for (size_t k = 0; k < k_hi; k += block) {
MulAddReg(&Arow[k + reg_cnt * 0], &Btrow[k + reg_cnt * 0], r0);
MulAddReg(&Arow[k + reg_cnt * 1], &Btrow[k + reg_cnt * 1], r1);
MulAddReg(&Arow[k + reg_cnt * 2], &Btrow[k + reg_cnt * 2], r2);
MulAddReg(&Arow[k + reg_cnt * 3], &Btrow[k + reg_cnt * 3], r3);
}
StoreReg(&Regs[reg_cnt * 0], r0);
StoreReg(&Regs[reg_cnt * 1], r1);
StoreReg(&Regs[reg_cnt * 2], r2);
StoreReg(&Regs[reg_cnt * 3], r3);
T sum1 = 0, sum2 = 0, sum3 = 0;
for (size_t k = 0; k < Regs.size(); ++k)
sum1 += Regs[k];
//for (size_t k = 0; k < A_cols - A_cols % block; ++k) sum3 += Arow[k] * Btrow[k];
for (size_t k = k_hi; k < A_cols; ++k)
sum2 += Arow[k] * Btrow[k];
C[i * A_rows + j] = sum2 + sum1;
}
}
}
#include <random>
#include <thread>
#include <chrono>
#include <type_traits>
template <typename T>
void Test(T const * A, size_t A_rows, size_t A_cols, T const * B, size_t B_rows, size_t B_cols, T const * C, T eps) {
for (size_t i = 0; i < A_rows / 16; ++i)
for (size_t j = 0; j < B_cols / 16; ++j) {
T sum = 0;
for (size_t k = 0; k < A_cols; ++k)
sum += A[i * A_cols + k] * B[k * B_cols + j];
ASSERT_MSG(std::abs(C[i * A_rows + j] - sum) <= eps * A_cols, "i " + std::to_string(i) + " j " + std::to_string(j) +
" C " + std::to_string(C[i * A_rows + j]) + " ref " + std::to_string(sum));
}
}
double Time() {
static auto const gtb = std::chrono::high_resolution_clock::now();
return std::chrono::duration_cast<std::chrono::duration<double>>(
std::chrono::high_resolution_clock::now() - gtb).count();
}
template <typename T>
void Run() {
size_t constexpr A_rows = 1000, A_cols = 1000, B_rows = 1000, B_cols = 1000;
std::string const tname = std::is_same_v<T, float> ? "float" : std::is_same_v<T, double> ?
"double" : std::is_same_v<T, int32_t> ? "int" : "<unknown>";
bool const is_int = tname == "int";
std::mt19937_64 rng{123};
std::vector<T> A(A_rows * A_cols), B(B_rows * B_cols), C(A_rows * B_cols);
for (size_t i = 0; i < A.size(); ++i)
A[i] = is_int ? (int64_t(rng() % (1 << 11)) - (1 << 10)) : (T(int64_t(rng() % (1 << 28)) - (1 << 27)) / T(1 << 27));
for (size_t i = 0; i < B.size(); ++i)
B[i] = is_int ? (int64_t(rng() % (1 << 11)) - (1 << 10)) : (T(int64_t(rng() % (1 << 28)) - (1 << 27)) / T(1 << 27));
auto tim = Time();
Mul(&A[0], A_rows, A_cols, &B[0], B_rows, B_cols, &C[0]);
tim = Time() - tim;
std::cout << std::setw(6) << tname << ": time " << std::fixed << std::setprecision(4) << tim << " sec" << std::endl;
Test(&A[0], A_rows, A_cols, &B[0], B_rows, B_cols, &C[0], tname == "float" ? T(1e-7) : tname == "double" ? T(1e-15) : T(0));
}
int main() {
try {
#if USE_OPENMP
omp_set_num_threads(std::thread::hardware_concurrency());
#endif
Run<float>();
Run<double>();
Run<int32_t>();
return 0;
} catch (std::exception const & ex) {
std::cout << "Exception: " << ex.what() << std::endl;
return -1;
}
}
Output:
float: time 0.1569 sec
double: time 0.3168 sec
int: time 0.1565 sec

Here's straight c++ code that runs in .08s with ints and .14s with floats or doubles. My system is 10 years old with relatively slow memory. Good at the time but now is now.
I agree with #VictorEijkhout that the best results would be with tuned code. There has been huge amounts of work optimizing those.
#include <vector>
#include <array>
#include <random>
#include <cassert>
#include <iostream>
#include <iomanip>
#include <thread>
#include <future>
#include <chrono>
struct Timer {
std::chrono::system_clock::time_point snapTime;
Timer() { snapTime = std::chrono::system_clock::now(); }
operator double() { return std::chrono::duration<double>(std::chrono::system_clock::now() - snapTime).count(); }
};
using DataType = int;
using std::array, std::vector;
constexpr int N = 1000, THREADS = 12;
static auto launchType = std::launch::async;
using Matrix = vector<array<DataType, N>>;
Matrix create_matrix() { return Matrix(N); };
Matrix product(Matrix const& v0, Matrix const& v1, double& time)
{
Matrix ret = create_matrix();
Matrix v2 = create_matrix();
Timer timer;
for (size_t r = 0; r < N; r++) // transpose first
for (size_t c = 0; c < N; c++)
v2[c][r] = v1[r][c];
// lambda to process sets of rows in separate threads
auto do_row_set = [&v0, &v2, &ret](size_t start, size_t last) {
for (size_t row = start; row < last; row++)
for (size_t col = 0; col < N; col++)
{
DataType tmp{}; // separate tmp variable significantly improves optimization
for (size_t col_t = 0; col_t < N; col_t++)
tmp += v0[row][col_t] * v2[col][col_t];
ret[row][col] = tmp;
}
};
vector<size_t> seq;
const size_t NN = N / THREADS;
// make a sequence of NN+1 rows from start to end
for (size_t thread_n = 0; thread_n < N; thread_n += NN)
seq.push_back(thread_n);
seq.push_back(N);
vector<std::future<void>> results; results.reserve(THREADS);
for (size_t i = 0; i < THREADS; i++)
results.emplace_back(std::async(launchType, do_row_set, seq[i], seq[i + 1]));
for (auto& x : results)
x.get();
time = timer;
return ret;
}
bool operator==(Matrix const& v0, Matrix const& v1)
{
for (size_t r = 0; r < N; r++)
for (size_t c = 0; c < N; c++)
if (v0[r][c] != v1[r][c])
return false;
return true;
}
int main()
{
auto fill = [](Matrix& v) {
std::mt19937_64 r(1);
std::normal_distribution dist(1.);
for (size_t row = 0; row < N; row++)
for (size_t col = 0; col < N; col++)
v[row][col] = DataType(dist(r));
};
Matrix m1 = create_matrix(), m2 = create_matrix(), m3 = create_matrix();
fill(m1); fill(m2);
auto process_test = [&m1, &m2](Matrix& out) {
const int rpt_count = 4;
double sum = 0;
for (int i = 0; i < rpt_count; i++)
{
double time;
out = product(m1, m2, time);
sum += time / rpt_count;
}
return sum;
};
std::cout << std::fixed << std::setprecision(4);
double time{};
time = process_test(m3);
std::cout << "Threads w transpose " << time << " secs.\n";
}

BigInt multiplication and to_string implementation outputs too many zeros

I created the following for multiplying two big integers stored with base 1,000,000,000 as a vector<int32_t>:
#include <iostream>
#include <vector>
#include <cmath>
#include <limits>
#include <algorithm>
template<typename T>
constexpr T power_of_10(T n)
{
return n < 0 ? 0 : n == 0 ? 1 : (n == 1 ? 10 : 10 * power_of_10(n - 1));
}
template<typename T>
constexpr T base_value = power_of_10<T>(std::numeric_limits<T>::digits10);
template<typename T>
constexpr T max_value = base_value<T> - 1;
class BigInt {
private:
static constexpr const std::uint32_t base = base_value<std::uint32_t>;
static constexpr const std::uint32_t max_digits = std::numeric_limits<std::uint32_t>::digits10;
std::vector<std::uint64_t> digits;
public:
BigInt(const char* value) : BigInt(std::string(value))
{
}
BigInt(const std::string& value)
{
constexpr const int stride = std::numeric_limits<std::uint32_t>::digits10;
const std::size_t size = value.size() / stride;
for (std::size_t i = 0; i < size; ++i)
{
auto it = value.begin();
auto jt = value.begin();
std::advance(it, i * stride);
std::advance(jt, (i * stride) + stride);
digits.push_back(std::stoull(std::string(it, jt)));
}
if (value.size() % stride)
{
auto remainder = std::string(value.begin() + size * stride, value.end());
digits.push_back(std::stoull(remainder));
}
std::reverse(digits.begin(), digits.end());
}
BigInt& multiply(const BigInt& other)
{
std::vector<std::uint64_t> product = std::vector<std::uint64_t>(digits.size() + other.digits.size(), 0);
for (std::size_t i = 0; i < other.digits.size(); ++i)
{
std::uint64_t carry = 0, total = 0;
for (std::size_t j = 0; j < digits.size(); ++j)
{
total = product.at(i + j) + (other.digits[i] * digits[j]) + carry;
carry = total / base;
total %= base;
product.at(i + j) = total;
}
if (carry)
{
product[i + digits.size()] = carry;
}
}
digits = product;
return *this;
}
std::string to_string() {
std::string result = std::to_string(digits[digits.size() - 1]);
//
// for (std::int64_t i = digits.size() - 2; i >= 0; --i)
// {
// std::string group = std::to_string(digits[i]);
// while (group.size() < max_digits) {
// group = '0' + group;
// }
// result += group;
// }
for (std::int64_t i = digits.size() - 2; i >= 0; --i)
{
std::uint64_t value = digits[i];
std::uint32_t divisor = base;
while(divisor)
{
if (divisor != base)
{
result += (value / divisor) + '0';
}
value %= divisor;
divisor /= 10;
}
}
return result;
}
};
int main(int argc, const char * argv[])
{
BigInt a = "5000000000";
BigInt b = "5000000000";
std::cout<<a.multiply(b).to_string()<<"\n";
std::cout<<"25000000000000000000"<<"\n";
return 0;
}
When I print the result of the multiplication, I am getting 5,000,000,000 * 5,000,000,000 = 250,000,000,000,000,000,000,000,000,000,000,000 which has way too many zeroes!
It should have 18 zeroes, but mine has 34.
I believe my multiplication algorithm is correct and my to_string is incorrect because 500 * 500 prints correctly as 25,000.
Any ideas what is wrong?

The problem comes from this line:
product[digits.size() + 1] = static_cast<T>(carry);
The index digits.size() + 1 is incorrect. It should be digits.size() + j.

bool judgement is so slow? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I'm optimizing the function, I try every way and even sse, and modified code to return from different position to see the calculate timespan but finally I found most of the time spends on the bool judgement. Even I replace all code in the if statement with a simple add operation in it, it still cost 6000ms.
My platform is gcc 4.7.1 e5506 cpu. Its input 'a' and 'b' is a 1000size int array, and 'asize', 'bsize' are corresponding array size. MATCH_MASK = 16383, I run the function 100000 times to statistics a timespan. Is there any good idea to the problem. Thank you!
if (aoffsets[i] && boffsets[i]) // this line costs most time
Code:
uint16_t aoffsets[DOUBLE_MATCH_MASK] = {0}; // important! or it will only be right on the first time
uint16_t* boffsets = aoffsets + MATCH_MASK;
uint8_t* seen = (uint8_t *)aoffsets;
auto fn_init_offsets = [](const int32_t* x, int n_size, uint16_t offsets[])->void
{
for (int i = 0; i < n_size; ++i)
offsets[MATCH_STRIP(x[i])] = i;
};
fn_init_offsets(a, asize, aoffsets);
fn_init_offsets(b, bsize, boffsets);
uint8_t topcount = 0;
int topoffset = 0;
{
std::vector<uint8_t> count_vec(asize + bsize + 1, 0); // it's the fastest way already, very near to tls
uint8_t* counts = &(count_vec[0]);
//return aoffsets[0]; // cost 1375 ms
for (int i = 0; i < MATCH_MASK; ++i)
{
if (aoffsets[i] && boffsets[i]) // this line costs most time
{
//++affsets[i]; // for test
int offset = (aoffsets[i] -= boffsets[i]);
if ((-n_maxoffset <= offset && offset <= n_maxoffset))
{
offset += bsize;
uint8_t n_cur_count = ++counts[offset];
if (n_cur_count > topcount)
{
topcount = n_cur_count;
topoffset = offset;
}
}
}
}
}
return aoffsets[0]; // cost 6000ms

First, memset(count_vec,0, N); of a memaligned buffer wins over std::vector by 30%.
You can try to use the branchless expression (aoffsets[i] * boffsets[i]) and calculate some of not-to-be-used expressions simultaneously: offset = aoffset[i]-boffset[i]; offset+bsize; offset+n_maxoffset;.
Depending on the typical range of offset, one could be tempted to calculate the min/max of (offset+bsize) to restrict the needed memset(count_vec) at the next iteration: no need to clear already zero values.
As pointed by philipp, it's good to interleave the operations -- then again, one can read both aoffset[i] and boffset[i] simultaneously from uint32_t aboffset[N]; with some clever bit masking (that generates change mask for: aoffset[i], aoffset[i+1]) one could possibly handle 2 sets in parallel using 64-bit simulated SIMD in pure c (up to the histogram accumulation part).

You can increase the speed of your program by reducing the cache misses: aoffsets[i] and boffsets[i] are relatively far away from each other in memory. By placing them next to each other, you speed up the program significantly. On my machine (e5400 cpu, VS2012) the execution time is reduced from 3.0 seconds to 2.3 seconds:
#include <vector>
#include <windows.h>
#include <iostream>
typedef unsigned short uint16_t;
typedef int int32_t;
typedef unsigned int uint32_t;
typedef unsigned char uint8_t;
#define MATCH_MASK 16383
#define DOUBLE_MATCH_MASK (MATCH_MASK*2)
static const int MATCH_BITS = 14;
static const int MATCH_LEFT = (32 - MATCH_BITS);
#define MATCH_STRIP(x) ((uint32_t)(x) >> MATCH_LEFT)
static const int n_maxoffset = 1000;
uint16_t test(int32_t* a, int asize, int32_t* b, int bsize)
{
uint16_t offsets[DOUBLE_MATCH_MASK] = {0};
auto fn_init_offsets = [](const int32_t* x, int n_size, uint16_t offsets[])->void
{
for (int i = 0; i < n_size; ++i)
offsets[MATCH_STRIP(x[i])*2 /*important. leave space for other offsets*/] = i;
};
fn_init_offsets(a, asize, offsets);
fn_init_offsets(b, bsize, offsets+1);
uint8_t topcount = 0;
int topoffset = 0;
{
std::vector<uint8_t> count_vec(asize + bsize + 1, 0);
uint8_t* counts = &(count_vec[0]);
for (int i = 0; i < MATCH_MASK; i+=2)
{
if (offsets[i] && offsets[i+1])
{
int offset = (offsets[i] - offsets[i+1]); //NOTE: I removed
if ((-n_maxoffset <= offset && offset <= n_maxoffset))
{
offset += bsize;
uint8_t n_cur_count = ++counts[offset];
if (n_cur_count > topcount)
{
topcount = n_cur_count;
topoffset = offset;
}
}
}
}
}
return offsets[0];
}
int main(int argc, char* argv[])
{
const int sizes = 1000;
int32_t* a = new int32_t[sizes];
int32_t* b = new int32_t[sizes];
for (int i=0;i<sizes;i++)
{
a[i] = rand()*rand();
b[i] = rand()*rand();
}
//Variablen
LONGLONG g_Frequency, g_CurentCount, g_LastCount;
QueryPerformanceFrequency((LARGE_INTEGER*)&g_Frequency);
QueryPerformanceCounter((LARGE_INTEGER*)&g_CurentCount);
int sum = 0;
for (int i=0;i<100000;i++)
{
sum += test(a,sizes,b,sizes);
}
QueryPerformanceCounter((LARGE_INTEGER*)&g_LastCount);
double dTimeDiff = (((double)(g_LastCount-g_CurentCount))/((double)g_Frequency));
std::cout << "Result: " << sum << std::endl <<"time: " << dTimeDiff << std::endl;
delete[] a;
delete[] b;
return 0;
}
compared to your version of test().
#include <vector>
#include <windows.h>
#include <iostream>
typedef unsigned short uint16_t;
typedef int int32_t;
typedef unsigned int uint32_t;
typedef unsigned char uint8_t;
#define MATCH_MASK 16383
#define DOUBLE_MATCH_MASK (MATCH_MASK*2)
static const int MATCH_BITS = 14;
static const int MATCH_LEFT = (32 - MATCH_BITS);
#define MATCH_STRIP(x) ((uint32_t)(x) >> MATCH_LEFT)
static const int n_maxoffset = 1000;
uint16_t test(int32_t* a, int asize, int32_t* b, int bsize)
{
uint16_t aoffsets[DOUBLE_MATCH_MASK] = {0}; // important! or it will only be right on the first time
uint16_t* boffsets = aoffsets + MATCH_MASK;
auto fn_init_offsets = [](const int32_t* x, int n_size, uint16_t offsets[])->void
{
for (int i = 0; i < n_size; ++i)
offsets[MATCH_STRIP(x[i])] = i;
};
fn_init_offsets(a, asize, aoffsets);
fn_init_offsets(b, bsize, boffsets);
uint8_t topcount = 0;
int topoffset = 0;
{
std::vector<uint8_t> count_vec(asize + bsize + 1, 0);
uint8_t* counts = &(count_vec[0]);
for (int i = 0; i < MATCH_MASK; ++i)
{
if (aoffsets[i] && boffsets[i])
{
int offset = (aoffsets[i] - boffsets[i]); //NOTE: I removed the -= because otherwise offset would always be positive!
if ((-n_maxoffset <= offset && offset <= n_maxoffset))
{
offset += bsize;
uint8_t n_cur_count = ++counts[offset];
if (n_cur_count > topcount)
{
topcount = n_cur_count;
topoffset = offset;
}
}
}
}
}
return aoffsets[0];
}
int main(int argc, char* argv[])
{
const int sizes = 1000;
int32_t* a = new int32_t[sizes];
int32_t* b = new int32_t[sizes];
for (int i=0;i<sizes;i++)
{
a[i] = rand()*rand();
b[i] = rand()*rand();
}
LONGLONG g_Frequency, g_CurentCount, g_LastCount;
QueryPerformanceFrequency((LARGE_INTEGER*)&g_Frequency);
QueryPerformanceCounter((LARGE_INTEGER*)&g_CurentCount);
int sum = 0;
for (int i=0;i<100000;i++)
{
sum += test(a,sizes,b,sizes);
}
QueryPerformanceCounter((LARGE_INTEGER*)&g_LastCount);
double dTimeDiff = (((double)(g_LastCount-g_CurentCount))/((double)g_Frequency));
std::cout << "Result: " << sum << std::endl <<"time: " << dTimeDiff << std::endl;
delete[] a;
delete[] b;
return 0;
}

Binary logic emulator logic error

If you need any more information just ask.
What I'm trying to do is emulate boolean logic that could be found on a computer in C++ code. Right now I'm trying to create a 32 bit adder. When I run the testing code, I get an output of 32, which is wrong it should be 64. I'm fairly sure that my add function is correct. The code for those gates is:
bool and(bool a, bool b)
{
return nand(nand(a,b),nand(a,b));
}
bool or(bool a, bool b)
{
return nand(nand(a,a),nand(b,b));
}
bool nor(bool a, bool b)
{
return nand(nand(nand(a,a), nand(b,b)),nand(nand(a,a), nand(b,b)));
}
The code for the add function:
bool *add(bool a, bool b, bool carry)
{
static bool out[2];
out[0] = nor(nor(a, b), carry);
out[1] = or(and(b,carry),and(a,b));
return out;
}
bool *add32(bool a[32], bool b[32], bool carry)
{
static bool out[33];
bool *tout;
for(int i = 0; i < 32; i++)
{
tout = add(a[i], b[i], (i==0)?false:tout[1]);
out[i] = tout[0];
}
out[32] = tout[1];
return out;
}
The code I'm using to test this is:
bool *a = int32tobinary(32);
bool *b = int32tobinary(32);
bool *c = add32(a, b, false);
__int32 i = binarytoint32(c);
Those two functions are:
bool *int32tobinary(__int32 a)
{
static bool _out[32];
bool *out = _out;
int i;
for(i = 31; i >= 0; i--)
{
out[i] = (a&1) ? true : false;
a >>= 1;
}
return out;
}
__int32 binarytoint32(bool b[32])
{
int result = 0;
int i;
for(i = 0; i < 32; i++)
{
if(b[i] == true)
result += (int)pow(2.0f, 32 - i - 1);
}
return result;
}

Where to start?
As noted in the comments, returning a pointer to a static variable is wrong.
This
out[0] = nor(nor(a, b), carry);
should be
out[0] = xor(xor(a, b), carry);
This out[1] = or(and(b,carry),and(a,b)); is also incorrect. out[1] must be true, when a == true and carry == true.
add32 assumes index 0 to be the LSB, int32tobinary and int32tobinary assume index 0 to be MSB.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

fast compare array of unsigned numbers - c++

Related

Convert array of bits to an array of bytes

How to optimize my C++ OpenMp Matrix Multiplication code

BigInt multiplication and to_string implementation outputs too many zeros

bool judgement is so slow? [closed]

Binary logic emulator logic error

Categories

Resources