Why does my code cause instruction-cache misses? - c++

According to cachegrind this checksum calculation routine is one of the greatest contributors to instruction-cache load and instruction-cache misses in the entire application:
#include <stdint.h>
namespace {
uint32_t OnesComplementSum(const uint16_t * b16, int len) {
uint32_t sum = 0;
uint32_t a = 0;
uint32_t b = 0;
uint32_t c = 0;
uint32_t d = 0;
// helper for the loop unrolling
auto run8 = [&] {
a += b16[0];
b += b16[1];
c += b16[2];
d += b16[3];
b16 += 4;
};
for (;;) {
if (len > 32) {
run8();
run8();
run8();
run8();
len -= 32;
continue;
}
if (len > 8) {
run8();
len -= 8;
continue;
}
break;
}
sum += (a + b) + (c + d);
auto reduce = [&]() {
sum = (sum & 0xFFFF) + (sum >> 16);
if (sum > 0xFFFF) sum -= 0xFFFF;
};
reduce();
while ((len -= 2) >= 0) sum += *b16++;
if (len == -1) sum += *(const uint8_t *)b16; // add the last byte
reduce();
return sum;
}
} // anonymous namespace
uint32_t get(const uint16_t* data, int length)
{
return OnesComplementSum(data, length);
}
See asm output here.
Maybe the it's caused by the loop unrolling, but the generated object code doesn't seem too excessive.
How can I improve the code?
Update
Because the checksum function was in an anonymous namespace it was inlined and duplicated by two functions that resided in the same cpp file.
The loop unrolling is still beneficial. Removing it slowed down the code.
Improving the infinite loop speeds up the code (but for some reason I get opposite results on my mac).
Before fixes: here you can see the two checksums and 17210 L1 IR misses
After fixes: after fixing the inlining problem and fixing the infinite loop the L1 instruction cache misses dropped to 8324.
"InstructionFetch" is higher in the fixed example. I'm not sure how to interpret that. Does it simply mean that's where most activity occurred? Or does it hint at a problem?

replace the main loop with just:
const int quick_len=len/8;
const uint16_t * const the_end=b16+quick_len*4;
len -= quick_len*8;
for (; b16+4 <= the_end; b16+=4)
{
a += b16[0];
b += b16[1];
c += b16[2];
d += b16[3];
}
There seems no need to manually loop unroll if you use -O3
Also, the test case allowed for too much optimization since the input was static and the results unused, also printing out the result helps verify optimized versions don't break anything
Full test I used:
int main(int argc, char *argv[])
{
using namespace std::chrono;
auto start_time = steady_clock::now();
int ret=OnesComplementSum((const uint8_t*)(s.data()+argc), s.size()-argc, 0);
auto elapsed_ns = duration_cast<nanoseconds>(steady_clock::now() - start_time).count();
std::cout << "loop=" << loop << " elapsed_ns=" << elapsed_ns << " = " << ret<< std::endl;
return ret;
}
Comparison with theis (CLEAN LOOP) and your improved version (UGLY LOOP) and a longer test string:
loop=CLEAN_LOOP elapsed_ns=8365 = 14031
loop=CLEAN_LOOP elapsed_ns=5793 = 14031
loop=CLEAN_LOOP elapsed_ns=5623 = 14031
loop=CLEAN_LOOP elapsed_ns=5585 = 14031
loop=UGLY_LOOP elapsed_ns=9365 = 14031
loop=UGLY_LOOP elapsed_ns=8957 = 14031
loop=UGLY_LOOP elapsed_ns=8877 = 14031
loop=UGLY_LOOP elapsed_ns=8873 = 14031
Verification here: http://coliru.stacked-crooked.com/a/52d670039de17943
EDIT:
In fact the whole function can be reduced to:
uint32_t OnesComplementSum(const uint8_t* inData, int len, uint32_t sum)
{
const uint16_t * b16 = reinterpret_cast<const uint16_t *>(inData);
const uint16_t * const the_end=b16+len/2;
for (; b16 < the_end; ++b16)
{
sum += *b16;
}
sum = (sum & uint16_t(-1)) + (sum >> 16);
return (sum > uint16_t(-1)) ? sum - uint16_t(-1) : sum;
}
Which does better than the OPs with -O3 but worse with -O2:
http://coliru.stacked-crooked.com/a/bcca1e94c2f394c7
loop=CLEAN_LOOP elapsed_ns=5825 = 14031
loop=CLEAN_LOOP elapsed_ns=5717 = 14031
loop=CLEAN_LOOP elapsed_ns=5681 = 14031
loop=CLEAN_LOOP elapsed_ns=5646 = 14031
loop=UGLY_LOOP elapsed_ns=9201 = 14031
loop=UGLY_LOOP elapsed_ns=8826 = 14031
loop=UGLY_LOOP elapsed_ns=8859 = 14031
loop=UGLY_LOOP elapsed_ns=9582 = 14031
So mileage may vary, and unless the exact architecture is known, I'd just go simpler

Related

C++ Index may have a value of '1000000' which is out of bounds

Intro
I do performance tests with the following array sizes 1,000,000, 10,000,000 and 100,000,000. My IDE (CLion) shows me the following warning message:
Index may have a value of '1000000' which is out of bounds
This warning appears only in the first for-loop in each line where bodies[i] is accessed:
void doPerformanceTestBodiesAsAoS(const size_t n) {
auto *bodies = new Body<long double, double[3], double[3]>[n];
for (size_t i = 0; i < n; ++i) {
bodies[i].mass = static_cast<double>(i);
bodies[i].position[0] = static_cast<double>(i);
bodies[i].position[1] = static_cast<double>(i + 1);
bodies[i].position[2] = static_cast<double>(i + 2);
bodies[i].velocity[0] = static_cast<double>(i);
bodies[i].velocity[1] = static_cast<double>(i + 1);
bodies[i].velocity[2] = static_cast<double>(i + 2);
}
const auto startTimeInNanoSeconds = std::chrono::high_resolution_clock::now();
for (size_t i = 0; i < (n - 1); ++i) {
const size_t nextBodyIndex = i + 1;
std::sqrt(
PhysicsEngine::math::pow2(bodies[i].position[0] - bodies[nextBodyIndex].position[0]) +
PhysicsEngine::math::pow2(bodies[i].position[1] - bodies[nextBodyIndex].position[1]) +
PhysicsEngine::math::pow2(bodies[i].position[2] - bodies[nextBodyIndex].position[2])
);
}
auto endTimeInNanoSeconds = std::chrono::high_resolution_clock::now();
std::cout.precision(17);
std::cout << "Processing of AoS took: "
<< static_cast<long double>(std::chrono::duration_cast<std::chrono::nanoseconds>(endTimeInNanoSeconds - startTimeInNanoSeconds).count()) / 1e+9 << " seconds for N = " << n << std::endl;
delete[] bodies;
}
// ...
TEST(PerformanceTest, PerformanceTest1MioBodiesAsAoS) {
doPerformanceTestBodiesAsAoS(1'000'000);
}
TEST(PerformanceTest, PerformanceTest10MioBodiesAsAoS) {
doPerformanceTestBodiesAsAoS(10'000'000);
}
TEST(PerformanceTest, PerformanceTest100MioBodiesAsAoS) {
doPerformanceTestBodiesAsAoS(100'000'000);
}
Problem
I'm now unsure whether this warning is a false positive (?!) or whether I'm missing something and thus possibly falsifying my performance tests.
What have I already done
I have checked that on my system sizes 1,000,000 - 100,000,000 do not exceed the MAX_INT value to ensure indexing does not arithmetically overflow. This shouldn't happen because int is 4 bytes and size_t is 8 bytes on my system.
I restarted my IDE several times, cleaned up caches etc.
I changed every size_t to int because I assume the index operator [] expects an int instead of size_t? But that didn't resolve the warning.
I searched for similar question but couldn't find any.
So I'm asking for a code review, whether it's false positive or I have a bug.
Additional information
I am using Windows 10 64, MingGW and an Intel processor.
The typedef of size_t is:
#ifdef _WIN64
__MINGW_EXTENSION typedef unsigned __int64 size_t;
#else
typedef unsigned int size_t;
#endif /* _WIN64 */

How to quickly divide a big uinteger into a word?

I am currently developing a class to work with big unsigned integers. However, I need incomplete functionality, namely:
bi_uint+=bi_uint - Already implemented. No complaints.
bi_uint*=std::uint_fast64_t - Already implemented. No complaints.
bi_uint/=std::uint_fast64_t - Implemented but works very slowly, also requires a type that is twice as wide as uint_fast64_t. In the test case, division was 35 times slower than multiplication
Next, I will give my implementation of division, which is based on a simple long division algorithm:
#include <climits>
#include <cstdint>
#include <limits>
#include <vector>
class bi_uint
{
public:
using u64_t = std::uint_fast64_t;
constexpr static std::size_t u64_bits = CHAR_BIT * sizeof(u64_t);
using u128_t = unsigned __int128;
static_assert(sizeof(u128_t) >= sizeof(u64_t) * 2);
//little-endian
std::vector<u64_t> data;
//User should guarantee data.size()>0 and val>0
void self_div(const u64_t val)
{
auto it = data.rbegin();
if(data.size() == 1) {
*it /= val;
return;
}
u128_t rem = 0;
if(*it < val) {
rem = *it++;
data.pop_back();
}
u128_t r = rem % val;
while(it != data.rend()) {
rem = (r << u64_bits) + *it;
const u128_t q = rem / val;
r = rem % val;
*it++ = static_cast<u64_t>(q);
}
}
};
You can see that the unsigned __int128 type was used, thefore, this option is not portable and is tied to a single compiler - GCC and also require x64 platform.
After reading the page about division algorithms, I feel the appropriate algorithm would be "Newton-Raphson division". However, the "Newton–Raphson division" algorithm seems complicated to me. I guess there is a simpler algorithm for dividing the type "big_uint/uint" that would have almost the same performance.
Q: How to fast divide a bi_uint into a u64_t?
I have about 10^6 iterations, each iteration uses all the operations listed
If this is easily achievable, then I would like to have portability and not use unsigned __int128. Otherwise, I prefer to abandon portability in favor of an easier way.
EDIT1:
This is an academic project, I am not able to use third-party libraries.
Part 1 (See Part 2 below)
I managed to speedup your division code 5x times on my old laptop (and even 7.5x times on GodBolt servers) using Barrett Reduction, this is a technique that allows to replace single division by several multiplications and additions. Implemented whole code from sctracth just today.
If you want you can jump directly to code location at the end of my post, without reading long description, as code is fully runnable without any knowledge or dependency.
Code below is only for Intel x64, because I used Intel only instructions and only 64-bit variants of them. Sure it can be re-written for x32 too and for other processors, because Barrett algorithm is generic.
To explain whole Barrett Reduction in short pseudo-code I'll write it in Python as it is simplest language suitable for understandable pseudo-code:
# https://www.nayuki.io/page/barrett-reduction-algorithm
def BarrettR(n, s):
return (1 << s) // n
def BarrettDivMod(x, n, r, s):
q = (x * r) >> s
t = x - q * n
return (q, t) if t < n else (q + 1, t - n)
Basically in pseudo code above BarrettR() is done only single time for same divisor (you use same single-word divisor for whole big integer division). BarrettDivMod() is used each time when you want to make division or modulus operations, basically given input x and divisor n it returns tuple (x / n, x % n), nothing else, but does it faster than regular division instruction.
In below C++ code I implement same two functions of Barrett, but do some C++ specific optimizations to make it even more faster. Optimizations are possible due to fact that divisor n is always 64-bit, x is 128-bit but higher half is always smaller than n (last assumption happens because higher half in your big integer division is always a remainder modulus n).
Barrett algorithm works with divisor n that is NOT a power of 2, so divisors like 0, 1, 2, 4, 8, 16, ... are not allowed. This trivial case of divisor you can cover just by doing right bit-shift of big integer, because dividing by power of 2 is just a bit-shift. Any other divisor is allowed, including even divisors that are not power of 2.
Also it is important to note that my BarrettDivMod() accepts ONLY dividend x that is strictly smaller than divisor * 2^64, in other words higher half of 128-bit dividend x should be smaller than divisor. This is always true in your case of your big integer divison function, as higher half is always a remainder modulus divisor. This rule for x should be checked by you, it is checked in my BarrettDivMod() only as DEBUG assertion that is removed in release.
You can notice that BarrettDivMod() has two big branches, these are two variants of same algorithm, first uses CLang/GCC only type unsigned __int128, second uses only 64-bit instructions and hence suitable for MSVC.
I tried to target three compilers CLang/GCC/MSVC, but some how MSVC version got only 2x faster with Barrett, while CLang/GCC are 5x faster. Maybe I did some bug in MSVC code.
You can see that I used your class bi_uint for time measurement of two versions of code - with regular divide and with Barrett. Important to note that I changed your code quite significantly, first to not use u128 (so that MSVC version compiles that has no u128), second not to modify data vector, so it does read only division and doesn't store final result (this read-only is needed for me to run speed tests very fast without copying data vector on each test iteration). So your code is broken in my snippet, it can't-be copy pasted to be used straight away, I only used your code for speed measurement.
Barrett reduction works faster not only because division is slower than multiplication, but also because multiplication and addition are both very well pipelined on moder CPUs, modern CPU can execute several mul or add instructions within one cycle, but only if these several mul/add don't depend on each other's result, in other words CPU can run several instructions in parallel within one cycle. As far as I know division can't be run in parallel, because there is only single module within CPU to make division, but still it is a bit pipelined, because after 50% of first division is done second division can be started in parallel at beginning of CPU pipeline.
On some computers you may notice that regular Divide version is much slower sometimes, this happens because CLang/GCC do fallback to library-based Soft implementation of Division even for 128 bit dividend. In this case my Barrett may show even 7-10x times speedup, as it doesn't use library functions.
To overcome issue described above, about Soft division, it is better to add Assembly code with usage of DIV instruction directly, or to find some Intrinsic function that implements this inside your compiler (I think CLang/GCC have such intrinsic). Also I can write this Assembly implementation if needed, just tell me in comments.
Update. As promised, implemented Assembly variant of 128 bit division for CLang/GCC, function UDiv128Asm(). After this change it is used as a main implementation for CLang/GCC 128 division instead of regular u128(a) / b. You may come back to regular u128 impementation by replacing #if 0 with #if 1 inside body of UDiv128() function.
Try it online!
#include <cstdint>
#include <bit>
#include <stdexcept>
#include <string>
#include <immintrin.h>
#if defined(_MSC_VER) && !defined(__clang__)
#define IS_MSVC 1
#else
#define IS_MSVC 0
#endif
#if IS_MSVC
#include <intrin.h>
#endif
#define ASSERT_MSG(cond, msg) { if (!(cond)) throw std::runtime_error("Assertion (" #cond ") failed at line " + std::to_string(__LINE__) + "! Msg: '" + std::string(msg) + "'."); }
#define ASSERT(cond) ASSERT_MSG(cond, "")
#ifdef _DEBUG
#define DASSERT_MSG(cond, msg) ASSERT_MSG(cond, msg)
#else
#define DASSERT_MSG(cond, msg)
#endif
#define DASSERT(cond) DASSERT_MSG(cond, "")
using u16 = uint16_t;
using u32 = uint32_t;
using i64 = int64_t;
using u64 = uint64_t;
using UllPtr = unsigned long long *;
inline int GetExp(double x) {
return int((std::bit_cast<uint64_t>(x) >> 52) & 0x7FF) - 1023;
}
inline size_t BitSizeWrong(uint64_t x) {
return x == 0 ? 0 : (GetExp(x) + 1);
}
inline size_t BitSize(u64 x) {
size_t r = 0;
if (x >= (u64(1) << 32)) {
x >>= 32;
r += 32;
}
while (x >= 0x100) {
x >>= 8;
r += 8;
}
while (x) {
x >>= 1;
++r;
}
return r;
}
#if !IS_MSVC
inline u64 UDiv128Asm(u64 h, u64 l, u64 d, u64 * r) {
u64 q;
asm (R"(
.intel_syntax
mov rdx, %V[h]
mov rax, %V[l]
div %V[d]
mov %V[r], rdx
mov %V[q], rax
)"
: [q] "=r" (q), [r] "=r" (*r)
: [h] "r" (h), [l] "r" (l), [d] "r" (d)
: "rax", "rdx"
);
return q;
}
#endif
inline std::pair<u64, u64> UDiv128(u64 hi, u64 lo, u64 d) {
#if IS_MSVC
u64 r, q = _udiv128(hi, lo, d, &r);
return std::make_pair(q, r);
#else
#if 0
using u128 = unsigned __int128;
auto const dnd = (u128(hi) << 64) | lo;
return std::make_pair(u64(dnd / d), u64(dnd % d));
#else
u64 r, q = UDiv128Asm(hi, lo, d, &r);
return std::make_pair(q, r);
#endif
#endif
}
inline std::pair<u64, u64> UMul128(u64 a, u64 b) {
#if IS_MSVC
u64 hi, lo = _umul128(a, b, &hi);
return std::make_pair(hi, lo);
#else
using u128 = unsigned __int128;
auto const x = u128(a) * b;
return std::make_pair(u64(x >> 64), u64(x));
#endif
}
inline std::pair<u64, u64> USub128(u64 a_hi, u64 a_lo, u64 b_hi, u64 b_lo) {
u64 r_hi, r_lo;
_subborrow_u64(_subborrow_u64(0, a_lo, b_lo, (UllPtr)&r_lo), a_hi, b_hi, (UllPtr)&r_hi);
return std::make_pair(r_hi, r_lo);
}
inline std::pair<u64, u64> UAdd128(u64 a_hi, u64 a_lo, u64 b_hi, u64 b_lo) {
u64 r_hi, r_lo;
_addcarry_u64(_addcarry_u64(0, a_lo, b_lo, (UllPtr)&r_lo), a_hi, b_hi, (UllPtr)&r_hi);
return std::make_pair(r_hi, r_lo);
}
inline int UCmp128(u64 a_hi, u64 a_lo, u64 b_hi, u64 b_lo) {
if (a_hi != b_hi)
return a_hi < b_hi ? -1 : 1;
return a_lo == b_lo ? 0 : a_lo < b_lo ? -1 : 1;
}
std::pair<u64, size_t> BarrettRS64(u64 n) {
// https://www.nayuki.io/page/barrett-reduction-algorithm
ASSERT_MSG(n >= 3 && (n & (n - 1)) != 0, "n " + std::to_string(n))
size_t const nbits = BitSize(n);
// 2^s = q * n + r; 2^s = (2^64 + q0) * n + r; 2^s - n * 2^64 = q0 * n + r
u64 const dnd_hi = (nbits >= 64 ? 0ULL : (u64(1) << nbits)) - n;
auto const q0 = UDiv128(dnd_hi, 0, n).first;
return std::make_pair(q0, nbits);
}
template <bool Use128 = true, bool Adjust = true>
std::pair<u64, u64> BarrettDivMod64(u64 x_hi, u64 x_lo, u64 n, u64 r, size_t s) {
// ((x_hi * 2^64 + x_lo) * (2^64 + r)) >> (64 + s)
DASSERT(x_hi < n);
#if !IS_MSVC
if constexpr(Use128) {
using u128 = unsigned __int128;
u128 const xf = (u128(x_hi) << 64) | x_lo;
u64 q = u64((u128(x_hi) * r + xf + u64((u128(x_lo) * r) >> 64)) >> s);
if (s < 64) {
u64 t = x_lo - q * n;
if constexpr(Adjust) {
u64 const mask = ~u64(i64(t - n) >> 63);
q += mask & 1;
t -= mask & n;
}
return std::make_pair(q, t);
} else {
u128 t = xf - u128(q) * n;
return t < n ? std::make_pair(q, u64(t)) : std::make_pair(q + 1, u64(t) - n);
}
} else
#endif
{
auto const w1a = UMul128(x_lo, r).first;
auto const [w2b, w1b] = UMul128(x_hi, r);
auto const w2c = x_hi, w1c = x_lo;
u64 w1, w2 = _addcarry_u64(0, w1a, w1b, (UllPtr)&w1);
w2 += _addcarry_u64(0, w1, w1c, (UllPtr)&w1);
w2 += w2b + w2c;
if (s < 64) {
u64 q = (w2 << (64 - s)) | (w1 >> s);
u64 t = x_lo - q * n;
if constexpr(Adjust) {
u64 const mask = ~u64(i64(t - n) >> 63);
q += mask & 1;
t -= mask & n;
}
return std::make_pair(q, t);
} else {
u64 const q = w2;
auto const [b_hi, b_lo] = UMul128(q, n);
auto const [t_hi, t_lo] = USub128(x_hi, x_lo, b_hi, b_lo);
return t_hi != 0 || t_lo >= n ? std::make_pair(q + 1, t_lo - n) : std::make_pair(q, t_lo);
}
}
}
#include <random>
#include <iomanip>
#include <iostream>
#include <chrono>
void TestBarrett() {
std::mt19937_64 rng{123}; //{std::random_device{}()};
for (size_t i = 0; i < (1 << 11); ++i) {
size_t const nbits = rng() % 63 + 2;
u64 n = 0;
do {
n = (u64(1) << (nbits - 1)) + rng() % (u64(1) << (nbits - 1));
} while (!(n >= 3 && (n & (n - 1)) != 0));
auto const [br, bs] = BarrettRS64(n);
for (size_t j = 0; j < (1 << 6); ++j) {
u64 const hi = rng() % n, lo = rng();
auto const [ref_q, ref_r] = UDiv128(hi, lo, n);
u64 bar_q = 0, bar_r = 0;
for (size_t k = 0; k < 2; ++k) {
bar_q = 0; bar_r = 0;
if (k == 0)
std::tie(bar_q, bar_r) = BarrettDivMod64<true>(hi, lo, n, br, bs);
else
std::tie(bar_q, bar_r) = BarrettDivMod64<false>(hi, lo, n, br, bs);
ASSERT_MSG(bar_q == ref_q && bar_r == ref_r, "i " + std::to_string(i) + ", j " + std::to_string(j) + ", k " + std::to_string(k) +
", nbits " + std::to_string(nbits) + ", n " + std::to_string(n) + ", bar_q " + std::to_string(bar_q) +
", ref_q " + std::to_string(ref_q) + ", bar_r " + std::to_string(bar_r) + ", ref_r " + std::to_string(ref_r));
}
}
}
}
class bi_uint
{
public:
using u64_t = std::uint64_t;
constexpr static std::size_t u64_bits = 8 * sizeof(u64_t);
//little-endian
std::vector<u64_t> data;
static auto constexpr DefPrep = [](auto n){
return std::make_pair(false, false);
};
static auto constexpr DefDivMod = [](auto dnd_hi, auto dnd_lo, auto dsr, auto const & prep){
return UDiv128(dnd_hi, dnd_lo, dsr);
};
//User should guarantee data.size()>0 and val>0
template <typename PrepT = decltype(DefPrep), typename DivModT = decltype(DefDivMod)>
void self_div(const u64_t val, PrepT const & Prep = DefPrep, DivModT const & DivMod = DefDivMod)
{
auto it = data.rbegin();
if(data.size() == 1) {
*it /= val;
return;
}
u64_t rem_hi = 0, rem_lo = 0;
if(*it < val) {
rem_lo = *it++;
//data.pop_back();
}
auto const prep = Prep(val);
u64_t r = rem_lo % val;
u64_t q = 0;
while(it != data.rend()) {
rem_hi = r;
rem_lo = *it;
std::tie(q, r) = DivMod(rem_hi, rem_lo, val, prep);
//*it++ = static_cast<u64_t>(q);
it++;
auto volatile out = static_cast<u64_t>(q);
}
}
};
void TestSpeed() {
auto Time = []{
static auto const gtb = std::chrono::high_resolution_clock::now();
return std::chrono::duration_cast<std::chrono::duration<double>>(
std::chrono::high_resolution_clock::now() - gtb).count();
};
std::mt19937_64 rng{123};
std::vector<u64> limbs, divisors;
for (size_t i = 0; i < (1 << 17); ++i)
limbs.push_back(rng());
for (size_t i = 0; i < (1 << 8); ++i) {
size_t const nbits = rng() % 63 + 2;
u64 n = 0;
do {
n = (u64(1) << (nbits - 1)) + rng() % (u64(1) << (nbits - 1));
} while (!(n >= 3 && (n & (n - 1)) != 0));
divisors.push_back(n);
}
std::cout << std::fixed << std::setprecision(3);
double div_time = 0;
{
bi_uint x;
x.data = limbs;
auto const tim = Time();
for (auto dsr: divisors)
x.self_div(dsr);
div_time = Time() - tim;
std::cout << "Divide time " << div_time << " sec" << std::endl;
}
{
bi_uint x;
x.data = limbs;
for (size_t i = 0; i < 2; ++i) {
if (IS_MSVC && i == 0)
continue;
auto const tim = Time();
for (auto dsr: divisors)
x.self_div(dsr, [](auto n){ return BarrettRS64(n); },
[i](auto dnd_hi, auto dnd_lo, auto dsr, auto const & prep){
return i == 0 ? BarrettDivMod64<true>(dnd_hi, dnd_lo, dsr, prep.first, prep.second) :
BarrettDivMod64<false>(dnd_hi, dnd_lo, dsr, prep.first, prep.second);
});
double const bar_time = Time() - tim;
std::cout << "Barrett" << (i == 0 ? "128" : "64 ") << " time " << bar_time << " sec, boost " << div_time / bar_time << std::endl;
}
}
}
int main() {
try {
TestBarrett();
TestSpeed();
return 0;
} catch (std::exception const & ex) {
std::cout << "Exception: " << ex.what() << std::endl;
return -1;
}
}
Output:
Divide time 3.171 sec
Barrett128 time 0.675 sec, boost 4.695
Barrett64 time 0.642 sec, boost 4.937
Part 2
As you have a very interesting question, after few days when I first published this post, I decided to implement from scratch all big integer math.
Below code implements math operations +, -, *, /, <<, >> for natural big numbers (positive integers), and +, -, *, / for floating big numbers. Both types of numbers are of arbitrary size (even millions of bits). Besides those as you requested, I fully implemented Newton-Raphson (both square and cubic variants) and Goldschmidt fast division algorithms.
Here is code snippet only for Newton-Raphson/Golschmidt functions, remaining code as it is very large is linked below on external server:
BigFloat & DivNewtonRaphsonSquare(BigFloat b) {
// https://en.wikipedia.org/wiki/Division_algorithm#Newton%E2%80%93Raphson_division
auto a = *this;
a.exp_ += b.SetScale(0);
if (b.sign_) {
a.sign_ = !a.sign_;
b.sign_ = false;
}
thread_local BigFloat two, c_48_17, c_32_17;
thread_local size_t static_prec = 0;
if (static_prec != BigFloat::prec_) {
two = 2;
c_48_17 = BigFloat(48) / BigFloat(17);
c_32_17 = BigFloat(32) / BigFloat(17);
static_prec = BigFloat::prec_;
}
BigFloat x = c_48_17 - c_32_17 * b;
for (size_t i = 0, num_iters = std::ceil(std::log2(double(static_prec + 1)
/ std::log2(17.0))) + 0.1; i < num_iters; ++i)
x = x * (two - b * x);
*this = a * x;
return BitNorm();
}
BigFloat & DivNewtonRaphsonCubic(BigFloat b) {
// https://en.wikipedia.org/wiki/Division_algorithm#Newton%E2%80%93Raphson_division
auto a = *this;
a.exp_ += b.SetScale(0);
if (b.sign_) {
a.sign_ = !a.sign_;
b.sign_ = false;
}
thread_local BigFloat one, c_140_33, c_m64_11, c_256_99;
thread_local size_t static_prec = 0;
if (static_prec != BigFloat::prec_) {
one = 1;
c_140_33 = BigFloat(140) / BigFloat(33);
c_m64_11 = BigFloat(-64) / BigFloat(11);
c_256_99 = BigFloat(256) / BigFloat(99);
static_prec = BigFloat::prec_;
}
BigFloat e, y, x = c_140_33 + b * (c_m64_11 + b * c_256_99);
for (size_t i = 0, num_iters = std::ceil(std::log2(double(static_prec + 1)
/ std::log2(99.0)) / std::log2(3.0)) + 0.1; i < num_iters; ++i) {
e = one - b * x;
y = x * e;
x = x + y + y * e;
}
*this = a * x;
return BitNorm();
}
BigFloat & DivGoldschmidt(BigFloat b) {
// https://en.wikipedia.org/wiki/Division_algorithm#Goldschmidt_division
auto a = *this;
a.exp_ += b.SetScale(0);
if (b.sign_) {
a.sign_ = !a.sign_;
b.sign_ = false;
}
BigFloat one = 1, two = 2, f;
for (size_t i = 0;; ++i) {
f = two - b;
a *= f;
b *= f;
if (i % 3 == 0 && (one - b).GetScale() < -i64(prec_) + i64(bit_sizeof(Word)))
break;
}
*this = a;
return BitNorm();
}
See Output: below, it will show that Newton-Raphson and Goldschmidt methods are actually 10x times slower than regular School-grade (called Reference in output) algorithm. Between each other these 3 advanced algorithms are about same speed. Probably Raphson/Goldschmidt could be faster if to use Fast Fourier Transform for multiplication, because multiplication of two large numbers takes 95% of time of these algorithms. In code below all results of Raphson/Goldschmidt algorithms are not only time-measured but also checked for correctness of results compared to School-grade (Reference) algorithm (see diff 2^... in console output, this shows how large is difference of result compared to school grade).
FULL SOURCE CODE HERE. Full code is so huge that it didn't fit into this StackOverflow due to SO limit of 30 000 characters per post, although I wrote this code from scracth specifically for this post. That's why providing external download link (PasteBin server), also click Try it online! linke below, it is same copy of code that is run live on GodBolt's linux servers:
Try it online!
Output:
========== 1 K bits ==========
Reference 0.000029 sec
Raphson2 0.000066 sec, boost 0.440x, diff 2^-8192
Raphson3 0.000092 sec, boost 0.317x, diff 2^-8192
Goldschmidt 0.000080 sec, boost 0.365x, diff 2^-1022
========== 2 K bits ==========
Reference 0.000071 sec
Raphson2 0.000177 sec, boost 0.400x, diff 2^-16384
Raphson3 0.000283 sec, boost 0.250x, diff 2^-16384
Goldschmidt 0.000388 sec, boost 0.182x, diff 2^-2046
========== 4 K bits ==========
Reference 0.000319 sec
Raphson2 0.000875 sec, boost 0.365x, diff 2^-4094
Raphson3 0.001122 sec, boost 0.285x, diff 2^-32768
Goldschmidt 0.000881 sec, boost 0.362x, diff 2^-32768
========== 8 K bits ==========
Reference 0.000484 sec
Raphson2 0.002281 sec, boost 0.212x, diff 2^-65536
Raphson3 0.002341 sec, boost 0.207x, diff 2^-65536
Goldschmidt 0.002432 sec, boost 0.199x, diff 2^-8189
========== 16 K bits ==========
Reference 0.001199 sec
Raphson2 0.009042 sec, boost 0.133x, diff 2^-16382
Raphson3 0.009519 sec, boost 0.126x, diff 2^-131072
Goldschmidt 0.009047 sec, boost 0.133x, diff 2^-16380
========== 32 K bits ==========
Reference 0.004311 sec
Raphson2 0.039151 sec, boost 0.110x, diff 2^-32766
Raphson3 0.041058 sec, boost 0.105x, diff 2^-262144
Goldschmidt 0.045517 sec, boost 0.095x, diff 2^-32764
========== 64 K bits ==========
Reference 0.016273 sec
Raphson2 0.165656 sec, boost 0.098x, diff 2^-524288
Raphson3 0.210301 sec, boost 0.077x, diff 2^-65535
Goldschmidt 0.208081 sec, boost 0.078x, diff 2^-65534
========== 128 K bits ==========
Reference 0.059469 sec
Raphson2 0.725865 sec, boost 0.082x, diff 2^-1048576
Raphson3 0.735530 sec, boost 0.081x, diff 2^-1048576
Goldschmidt 0.703991 sec, boost 0.084x, diff 2^-131069
========== 256 K bits ==========
Reference 0.326368 sec
Raphson2 3.007454 sec, boost 0.109x, diff 2^-2097152
Raphson3 2.977631 sec, boost 0.110x, diff 2^-2097152
Goldschmidt 3.363632 sec, boost 0.097x, diff 2^-262141
========== 512 K bits ==========
Reference 1.138663 sec
Raphson2 12.827783 sec, boost 0.089x, diff 2^-524287
Raphson3 13.799401 sec, boost 0.083x, diff 2^-524287
Goldschmidt 15.836072 sec, boost 0.072x, diff 2^-524286
On most of the modern CPUs, division is indeed much slower than multiplication.
Referring to
https://agner.org/optimize/instruction_tables.pdf
That on Intel Skylake an MUL/IMUL has a latency of 3-4 cycles; while an DIV/IDIV could take 26-90 cycles; which is 7 - 23 times slower than MUL; so your initial benchmark result isn't really a surprise.
If you happen to be on x86 CPU, as showing in the answer below, if this is indeed the bottleneck you could try to utilize AVX/SSE instructions. Basically you'd need to rely on special instructions than a general one like DIV/IDIV.
How to divide a __m256i vector by an integer variable?

Karatsuba Integer Multiplication failing with segmentation fault

As I run the program, it crashes with segmentation fault. Also, when I debug the code in codeblocks IDE, I am unable to debug it as well. The program crashes even before debugging begins. I am not able to understand the problem. Any help would be appreciated. Thanks!!
#include <iostream>
#include <math.h>
#include <string>
using namespace std;
// Method to make strings of equal length
int makeEqualLength(string& fnum,string& snum){
int l1 = fnum.length();
int l2 = snum.length();
if(l1>l2){
int d = l1-l2;
while(d>0){
snum = '0' + snum;
d--;
}
return l1;
}
else if(l2>l1){
int d = l2-l1;
while(d>0){
fnum = '0' + fnum;
d--;
}
return l2;
}
else
return l1;
}
int singleDigitMultiplication(string& fnum,string& snum){
return ((fnum[0] -'0')*(snum[0] -'0'));
}
string addStrings(string& s1,string& s2){
int length = makeEqualLength(s1,s2);
int carry = 0;
string result;
for(int i=length-1;i>=0;i--){
int fd = s1[i]-'0';
int sd = s2[i]-'0';
int sum = (fd+sd+carry)%10+'0';
carry = (fd+sd+carry)/10;
result = (char)sum + result;
}
result = (char)carry + result;
return result;
}
long int multiplyByKaratsubaMethod(string fnum,string snum){
int length = makeEqualLength(fnum,snum);
if(length==0) return 0;
if(length==1) return singleDigitMultiplication(fnum,snum);
int fh = length/2;
int sh = length - fh;
string Xl = fnum.substr(0,fh);
string Xr = fnum.substr(fh,sh);
string Yl = snum.substr(0,fh);
string Yr = snum.substr(fh,sh);
long int P1 = multiplyByKaratsubaMethod(Xl,Yl);
long int P3 = multiplyByKaratsubaMethod(Xr,Yr);
long int P2 = multiplyByKaratsubaMethod(addStrings(Xl,Xr),addStrings(Yl,Yr)) - P1-P3;
return (P1*pow(10,length) + P2*pow(10,length/2) + P3);
}
int main()
{
string firstNum = "62";
string secondNum = "465";
long int result = multiplyByKaratsubaMethod(firstNum,secondNum);
cout << result << endl;
return 0;
}
There are three serious issues in your code:
result = (char)carry + result; does not work.The carry has a value between 0 (0 * 0) and 8 (9 * 9). It has to be converted to the corresponding ASCII value:result = (char)(carry + '0') + result;.
This leads to the next issue: The carry is even inserted if it is 0. There is an if statement missing:if (carry/* != 0*/) result = (char)(carry + '0') + result;.
After fixing the first two issues and testing again, the stack overflow still occurs. So, I compared your algorithm with another I found by google:Divide and Conquer | Set 4 (Karatsuba algorithm for fast multiplication)(and possibly was your origin because it's looking very similar). Without digging deeper, I fixed what looked like a simple transfer mistake:return P1 * pow(10, 2 * sh) + P2 * pow(10, sh) + P3;(I replaced length by 2 * sh and length/2 by sh like I saw it in the googled code.) This became obvious for me seeing in the debugger that length can have odd values so that sh and length/2 are distinct values.
Afterwards, your program became working.
I changed the main() function to test it a little bit harder:
#include <cmath>
#include <iostream>
#include <string>
using namespace std;
string intToStr(int i)
{
string text;
do {
text.insert(0, 1, i % 10 + '0');
i /= 10;
} while (i);
return text;
}
// Method to make strings of equal length
int makeEqualLength(string &fnum, string &snum)
{
int l1 = (int)fnum.length();
int l2 = (int)snum.length();
return l1 < l2
? (fnum.insert(0, l2 - l1, '0'), l2)
: (snum.insert(0, l1 - l2, '0'), l1);
}
int singleDigitMultiplication(const string& fnum, const string& snum)
{
return ((fnum[0] - '0') * (snum[0] - '0'));
}
string addStrings(string& s1, string& s2)
{
int length = makeEqualLength(s1, s2);
int carry = 0;
string result;
for (int i = length - 1; i >= 0; --i) {
int fd = s1[i] - '0';
int sd = s2[i] - '0';
int sum = (fd + sd + carry) % 10 + '0';
carry = (fd + sd + carry) / 10;
result.insert(0, 1, (char)sum);
}
if (carry) result.insert(0, 1, (char)(carry + '0'));
return result;
}
long int multiplyByKaratsubaMethod(string fnum, string snum)
{
int length = makeEqualLength(fnum, snum);
if (length == 0) return 0;
if (length == 1) return singleDigitMultiplication(fnum, snum);
int fh = length / 2;
int sh = length - fh;
string Xl = fnum.substr(0, fh);
string Xr = fnum.substr(fh, sh);
string Yl = snum.substr(0, fh);
string Yr = snum.substr(fh, sh);
long int P1 = multiplyByKaratsubaMethod(Xl, Yl);
long int P3 = multiplyByKaratsubaMethod(Xr, Yr);
long int P2
= multiplyByKaratsubaMethod(addStrings(Xl, Xr), addStrings(Yl, Yr))
- P1 - P3;
return P1 * pow(10, 2 * sh) + P2 * pow(10, sh) + P3;
}
int main()
{
int nErrors = 0;
for (int i = 0; i < 1000; i += 3) {
for (int j = 0; j < 1000; j += 3) {
long int result
= multiplyByKaratsubaMethod(intToStr(i), intToStr(j));
bool ok = result == i * j;
cout << i << " * " << j << " = " << result
<< (ok ? " OK." : " ERROR!") << endl;
nErrors += !ok;
}
}
cout << nErrors << " error(s)." << endl;
return 0;
}
Notes about changes I've made:
Concerning std library: Please, don't mix headers with ".h" and without. Every header of std library is available in "non-suffix-flavor". (The header with ".h" are either C header or old-fashioned.) Headers of C library have been adapted to C++. They have the old name with prefix "c" and without suffix ".h".
Thus, I replaced #include <math.h> by #include <cmath>.
I couldn't resist to make makeEqualLength() a little bit shorter.
Please, note, that a lot of methods in std use std::size_t instead of int or unsigned. std::size_t has appropriate width to do array subscript and pointer arithmetic i.e it has "machine word width". I believed a long time that int and unsigned should have "machine word width" also and didn't care about size_t. When we changed in Visual Studio from x86 (32 bits) to x64 (64 bits), I learnt the hard way that I had been very wrong: std::size_t is 64 bits now but int and unsigned are still 32 bits. (MS VC++ is not an exception. Other compiler vendors (but not all) do it the same way.)I inserted some C type casts to remove the warnings from compiler output. Such casts to remove warnings (regardless you use C casts or better the C++ casts) should always be used with care and should be understood as confirmation: Dear compiler. I see you have concerns but I (believe to) know and assure you that it should work fine.
I'm not sure about your intention to use long int in some places. (Probably, you transferred this code from original source without caring about.) As your surely know, the actual size of all int types may differ to match best performance of the target platform. I'm working on a Intel-PC with Windows 10, using Visual Studio. sizeof (int) == sizeof (long int) (32 bits). This is independent whether I compile x86 code (32 bits) or x64 code (64 bits). The same is true for gcc (on cygwin in my case) as well as on any Intel-PC with Linux (AFAIK). For a granted larger type than int you have to choose long long int.
I did the sample session in cygwin on Windows 10 (64 bit):
$ g++ -std=c++11 -o karatsuba karatsuba.cc
$ ./karatsuba
0 * 0 = 0 OK.
0 * 3 = 0 OK.
0 * 6 = 0 OK.
etc. etc.
999 * 993 = 992007 OK.
999 * 996 = 995004 OK.
999 * 999 = 998001 OK.
0 error(s).
$

C++ vectorization of conditional code with intrinsics

I tried to enable vectorization of an often-used function to improve the performance.
The algorithm should do the following and is called ~4.000.000 times!
Input: double* cellvalue
Output: int8* Output (8 bit integer, c++ char)
Algo:
if (cellvalue > upper_threshold )
*output = 1;
else if (cellvalue < lower_threshold)
*output = -1;
else
*output = 0;
My first vectorization approach to compute 2 doubles in parallel looks like:
__m128d lowerThresh = _mm_set1_pd(m_lowerThreshold);
__m128d upperThresh = _mm_set1_pd(m_upperThreshold);
__m128d vec = _mm_load_pd(cellvalue);
__m128d maskLower = _mm_cmplt_pd(vec, lowerThresh); // less than
__m128d maskUpper = _mm_cmpgt_pd(vec, upperThresh); // greater than
static const tInt8 negOne = -1;
static const tInt8 posOne = 1;
output[0] = (negOne & *((tInt8*)&maskLower.m128d_f64[0])) | (posOne & *((tInt8*)&maskUpper.m128d_f64[0]));
output[1] = (negOne & *((tInt8*)&maskLower.m128d_f64[1])) | (posOne & *((tInt8*)&maskUpper.m128d_f64[1]));
Does this make sense to you? It works, but I think the last part to create the output is very complicated. Is there any faster method to do this?
Also I tried to compute 8 values at once with nearly the same code. Will this perform better? Does the order of instructions make sense?
__m128d lowerThresh = _mm_set1_pd(m_lowerThreshold);
__m128d upperThresh = _mm_set1_pd(m_upperThreshold);
// load 4 times
__m128d vec0 = _mm_load_pd(cellValue);
__m128d vec1 = _mm_load_pd(cellValue + 2);
__m128d vec2 = _mm_load_pd(cellValue + 4);
__m128d vec3 = _mm_load_pd(cellValue + 6);
__m128d maskLower0 = _mm_cmplt_pd(vec0, lowerThresh); // less than
__m128d maskLower1 = _mm_cmplt_pd(vec1, lowerThresh); // less than
__m128d maskLower2 = _mm_cmplt_pd(vec2, lowerThresh); // less than
__m128d maskLower3 = _mm_cmplt_pd(vec3, lowerThresh); // less than
__m128d maskUpper0 = _mm_cmpgt_pd(vec0, upperThresh); // greater than
__m128d maskUpper1 = _mm_cmpgt_pd(vec1, upperThresh); // greater than
__m128d maskUpper2 = _mm_cmpgt_pd(vec2, upperThresh); // greater than
__m128d maskUpper3 = _mm_cmpgt_pd(vec3, upperThresh); // greater than
static const tInt8 negOne = -1;
static const tInt8 posOne = 1;
output[0] = (negOne & *((tInt8*)&maskLower0.m128d_f64[0])) | (posOne & *((tInt8*)&maskUpper0.m128d_f64[0]));
output[1] = (negOne & *((tInt8*)&maskLower0.m128d_f64[1])) | (posOne & *((tInt8*)&maskUpper0.m128d_f64[1]));
output[2] = (negOne & *((tInt8*)&maskLower1.m128d_f64[0])) | (posOne & *((tInt8*)&maskUpper1.m128d_f64[0]));
output[3] = (negOne & *((tInt8*)&maskLower1.m128d_f64[1])) | (posOne & *((tInt8*)&maskUpper1.m128d_f64[1]));
output[4] = (negOne & *((tInt8*)&maskLower2.m128d_f64[0])) | (posOne & *((tInt8*)&maskUpper2.m128d_f64[0]));
output[5] = (negOne & *((tInt8*)&maskLower2.m128d_f64[1])) | (posOne & *((tInt8*)&maskUpper2.m128d_f64[1]));
output[6] = (negOne & *((tInt8*)&maskLower3.m128d_f64[0])) | (posOne & *((tInt8*)&maskUpper3.m128d_f64[0]));
output[7] = (negOne & *((tInt8*)&maskLower3.m128d_f64[1])) | (posOne & *((tInt8*)&maskUpper3.m128d_f64[1]));
Hopefully you can help me to understand the vectorization thing a bit better ;)
_mm_cmplt_pd and _mm_cmpgt_pd produce a result that is already either 0 or -1; anding it with -1 does nothing, and anding it with 1 is equivalent to negating it. Thus, if upper_threshold > lower_threshold (so that both conditions are never true), you can just write*:
_mm_storeu_si128(output, _mm_sub_epi64(maskLower, maskUpper));
(*) it's unclear what an "int8" is in your code; that's not a standard type in C++. It could be an 8-byte int, which is the behavior I've used here. If it's an 8-bit int instead, you'll want to pack up a bunch of results to store together.
Questioner clarifies that they intend int8 to be an 8-bit integer. In that case, you can do the following for a quick implementation:
__m128i result = _mm_sub_epi64(maskLower, maskUpper)
output[0] = result.m128i_i64[0]; // .m128i_i64 is an oddball MSVC-ism, so
output[1] = result.m128i_i64[1]; // I'm not 100% sure about the syntax here.
but you may also want to try packing eight result vectors together and store them with a single store operation.
If you change the code not to branch, then a modern compiler will do the vectorization for you.
Here's the test I ran:
#include <stdint.h>
#include <iostream>
#include <random>
#include <vector>
#include <chrono>
using Clock = std::chrono::steady_clock;
using std::chrono::milliseconds;
typedef double Scalar;
typedef int8_t Integer;
const Scalar kUpperThreshold = .5;
const Scalar kLowerThreshold = .2;
void compute_comparisons1(int n, const Scalar* xs, Integer* ys) {
#pragma simd
for (int i=0; i<n; ++i) {
Scalar x = xs[i];
ys[i] = (x > kUpperThreshold) - (x < kLowerThreshold);
}
}
void compute_comparisons2(int n, const Scalar* xs, Integer* ys) {
for (int i=0; i<n; ++i) {
Scalar x = xs[i];
Integer& y = ys[i];
if (x > kUpperThreshold)
y = 1;
else if(x < kLowerThreshold)
y = -1;
else
y = 0;
}
}
const int N = 4000000;
auto random_generator = std::mt19937{0};
int main() {
std::vector<Scalar> xs(N);
std::vector<Integer> ys1(N);
std::vector<Integer> ys2(N);
std::uniform_real_distribution<Scalar> dist(0, 1);
for (int i=0; i<N; ++i)
xs[i] = dist(random_generator);
auto time0 = Clock::now();
compute_comparisons1(N, xs.data(), ys1.data());
auto time1 = Clock::now();
compute_comparisons2(N, xs.data(), ys2.data());
auto time2 = Clock::now();
std::cout << "v1: " << std::chrono::duration_cast<milliseconds>(time1 - time0).count() << "\n";
std::cout << "v2: " << std::chrono::duration_cast<milliseconds>(time2 - time1).count() << "\n";
for (int i=0; i<N; ++i) {
if (ys1[i] != ys2[i]) {
std::cout << "Error!\n";
return -1;
}
}
return 0;
}
If you compile with a recent version of gcc (I used 4.8.3) and use the flags "-O3 -std=c++11 -march=native -S", you can verify by looking at the assembly that it vectorizes the code. And it runs much faster (3 milliseconds vs 16 milliseconds on my machine.)
Also, I'm not sure what your requirements are; but if you can live with less precision, then using float instead of double will further improve the speed (double takes 1.8x as long on my machine)

Is there any code Optimization method for the following c++ program

BYTE * srcData;
BYTE * pData;
int i,j;
int srcPadding;
//some variable initialization
for (int r = 0;r < h;r++,srcData+= srcPadding)
{
for (int col = 0;col < w;col++,pData += 4,srcData += 3)
{
memcpy(pData,srcData,3);
}
}
I've tried loop unrolling, but it helps little.
int segs = w / 4;
int remain = w - segs * 4;
for (int r = 0;r < h;r++,srcData+= srcPadding)
{
int idx = 0;
for (idx = 0;idx < segs;idx++,pData += 16,srcData += 12)
{
memcpy(pData,srcData,3);
*(pData + 3) = 0xFF;
memcpy(pData + 4,srcData + 3,3);
*(pData + 7) = 0xFF;
memcpy(pData + 8,srcData + 6,3);
*(pData + 11) = 0xFF;
memcpy(pData + 12,srcData + 9,3);
*(pData + 15) = 0xFF;
}
for (idx = 0;idx < remain;idx++,pData += 4,srcData += 3)
{
memcpy(pData,srcData,3);
*(pData + 3) = 0xFF;
}
}
Depending on your compiler, you may not want memcpy at all for such a small copy. Here is a variant version for the body of your unrolled loop; see if it's faster:
uint32_t in0 = *(uint32_t*)(srcData);
uint32_t in1 = *(uint32_t*)(srcData + 4);
uint32_t in2 = *(uint32_t*)(srcData + 8);
uint32_t out0 = UINT32_C(0xFF000000) | (in0 & UINT32_C(0x00FFFFFF));
uint32_t out1 = UINT32_C(0xFF000000) | (in0 >> 24) | ((in1 & 0xFFFF) << 8);
uint32_t out2 = UINT32_C(0xFF000000) | (in1 >> 16) | ((in2 & 0xFF) << 16);
uint32_t out3 = UINT32_C(0xFF000000) | (in2 >> 8);
*(uint32_t*)(pData) = out0;
*(uint32_t*)(pData + 4) = out1;
*(uint32_t*)(pData + 8) = out2;
*(uint32_t*)(pData + 12) = out3;
You should also declare srcData and pData as BYTE * restrict pointers so the compiler will know they don't alias.
I don't see much that you're doing that isn't necessary. You could change the post-increments to pre-increments (idx++ to ++idx, for instance), but that won't have a measurable effect.
Additionally, you could use std::copy instead of memcpy. std::copy has more information available to it and in theory can pick the most efficient way to copy things. Unfortunately I don't believe that many STL implementations actually take advantage of the extra information.
The only thing that I expect would make a difference is that there's no reason to wait for one memcpy to finish before starting the next. You could use OpenMP or Intel Threading Building Blocks (or a thread queue of some kind) to parallelize the loops.
Don't call memcpy, just do the copy by hand. The function call overhead isn't worth it unless you can copy more than 3 bytes at a time.
As far as this particular loop goes, you may want to look at a technique called Duff's device, which is a loop-unrolling technique that takes advantage of the switch construct.
Maybe changing to a while loop instead of nested for loops:
BYTE *src = srcData;
BYTE *dest = pData;
int maxsrc = h*(w*3+srcPadding);
int offset = 0;
int maxoffset = w*3;
while (src+offset < maxsrc) {
*dest++ = *(src+offset++);
*dest++ = *(src+offset++);
*dest++ = *(src+offset++);
dest++;
if (offset > maxoffset) {
src += srcPadding;
offset = 0;
}
}