Fastest way to lookup bytes from 64-byte arrays

Fastest way to lookup bytes from 64-byte arrays - c++

I have the following critical place in the code: I need to look up from 64-byte array around 1'000'000 times.
Minimal code:
#include <iostream>
#include <stdint.h>
#include <random>
#include <chrono>
#include <ctime>
#define TYPE uint8_t
#define n_lookup 64
int main(){
const int n_indices = 1000000;
TYPE lookup[n_lookup];
TYPE indices[n_indices];
TYPE result[n_indices];
// preparations
std::default_random_engine generator;
std::uniform_int_distribution<int> distribution(0, n_lookup);
for (int i=0; i < n_indices; i++) indices[i] = distribution(generator);
for (int i=0; i < n_lookup; i++) lookup[i] = distribution(generator);
std::chrono::time_point<std::chrono::system_clock> start = std::chrono::system_clock::now();
// main loop:
for (int i=0; i < n_indices; i++) {
result[i] = lookup[indices[i]];
}
std::chrono::time_point<std::chrono::system_clock> end = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed_seconds = end - start;
std::cout << "computation took " << elapsed_seconds.count() * 1e9 / n_indices << " ns per element"<< std::endl;
// printing random numbers to avoid code elimination
std::cout << result[12] << result[45];
return 0;
}
After compiling with g++ lookup.cpp -std=gnu++11 -O3 -funroll-loops I get a bit less than 1ns per element on modern CPU.
I need this operation to work 2-3 times faster (without threads). How can I do this?
P.S. I also was investigating AVX512 (512 bits is exactly the size of lookup table!) instruction set, but it lacks 8-bit gather operations!

indices and result vectors are in different places in memory, but accessed in the same time. It leads to cache-misses. I suggest you to merge result and indices in one vector. Here is the code:
#include <iostream>
#include <stdint.h>
#include <random>
#include <chrono>
#include <ctime>
#define TYPE uint8_t
#define n_lookup 64
int main(){
const int n_indices = 2000000;
TYPE lookup[n_lookup];
// Merge indices and result
// If i is index, then i+1 is result
TYPE ind_res[n_indices];
// preparations
std::default_random_engine generator;
std::uniform_int_distribution<int> distribution(0, n_lookup);
for (int i=0; i < n_indices; i += 2) ind_res[i] = distribution(generator);
for (int i=0; i < n_lookup; i++) lookup[i] = distribution(generator);
std::chrono::time_point<std::chrono::system_clock> start = std::chrono::system_clock::now();
// main loop:
for (int i=0; i < n_indices; i += 2) {
ind_res[i+1] = lookup[ind_res[i]]; // more dense access here, no cache-miss
}
std::chrono::time_point<std::chrono::system_clock> end = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed_seconds = end - start;
std::cout << "computation took " << elapsed_seconds.count() * 1e9 / n_indices << " ns per element"<< std::endl;
// printing random numbers to avoid code elimination
std::cout << ind_res[24] << ind_res[90];
return 0;
}
My tests shows tha this code runs much faster.

with -march=native this is what your loops compiles to:
movq %rax, %rbx
xorl %eax, %eax
.L145:
movzbl 128(%rsp,%rax), %edx
movzbl 64(%rsp,%rdx), %edx
movb %dl, 1000128(%rsp,%rax)
addq $1, %rax
cmpq $1000000, %rax
jne .L145
I'm struggling to see how that gets any quicker without parallelisation.
By changing TYPE to int32_t, it gets vectorised:
vpcmpeqd %ymm2, %ymm2, %ymm2
movq %rax, %rbx
xorl %eax, %eax
.L145:
vmovdqa -8000048(%rbp,%rax), %ymm1
vmovdqa %ymm2, %ymm3
vpgatherdd %ymm3, -8000304(%rbp,%ymm1,4), %ymm0
vmovdqa %ymm0, -4000048(%rbp,%rax)
addq $32, %rax
cmpq $4000000, %rax
jne .L145
vzeroupper
Might that help?

At first, there is a bug, distribution(0, 64) produces numbers 0 to 64, 64 can not fit into the array.
You can speed up the lookup 2x by looking up two values a time:
#include <iostream>
#include <stdint.h>
#include <random>
#include <chrono>
#include <ctime>
#define TYPE uint8_t
#define TYPE2 uint16_t
#define n_lookup 64
void tst() {
const int n_indices = 1000000;// has to be multiple of 2
TYPE lookup[n_lookup];
TYPE indices[n_indices];
TYPE result[n_indices];
TYPE2 lookup2[n_lookup * 256];
// preparations
std::default_random_engine generator;
std::uniform_int_distribution<int> distribution(0, n_lookup-1);
for (int i = 0; i < n_indices; i++) indices[i] = distribution(generator);
for (int i = 0; i < n_lookup; i++) lookup[i] = distribution(generator);
for (int i = 0; i < n_lookup; ++i) {
for (int j = 0; j < n_lookup; ++j) {
lookup2[(i << 8) | j] = (lookup[i] << 8) | lookup[j];
}
}
std::chrono::time_point<std::chrono::system_clock> start = std::chrono::system_clock::now();
TYPE2* indices2 = (TYPE2*)indices;
TYPE2* result2 = (TYPE2*)result;
// main loop:
for (int i = 0; i < n_indices / 2; ++i) {
*result2++ = lookup2[*indices2++];
}
std::chrono::time_point<std::chrono::system_clock> end = std::chrono::system_clock::now();
for (int i = 0; i < n_indices; i++) {
if (result[i] != lookup[indices[i]]) {
std::cout << "!!!!!!!!!!!!!ERROR!!!!!!!!!!!!!";
}
}
std::chrono::duration<double> elapsed_seconds = end - start;
std::cout << "computation took " << elapsed_seconds.count() * 1e9 / n_indices << " ns per element" << std::endl;
// printing random numbers to avoid code elimination
std::cout << result[12] << result[45];
}
int main() {
tst();
std::cin.get();
return 0;
}

Your code is already really fast. However
(on my system) the execution is about 4.858 % faster when you change
const int n_indices = 1000000;
to
const int n_indices = 1048576; // 2^10
This is not much, but it's something.

Related

Why std::count is slower that plain for loop with MSVC Compiler but equal with GCC?

I am testing the performance of the C++ standard library algorithms and encountered a weird thing.
Here is my code to compare the performance of std::count vs plain for loop:
#include <algorithm>
#include <vector>
#include <iostream>
#include <chrono>
using namespace std::chrono;
int my_count(const std::vector<int>& v, int val) {
int num = 0;
for (int i: v) {
if (i == val)
num++;
}
return num;
}
int main()
{
int total_count = 0;
std::vector<int> v;
v.resize(100000000);
// Fill vector
for (int i = 0; i < v.size(); i++) {
v[i] = i % 10000;
}
int val = 1;
{
auto start = high_resolution_clock::now();
total_count += std::count(v.begin(), v.end(), val);
auto stop = high_resolution_clock::now();
std::cout << "std::count time: " << duration_cast<microseconds>(stop - start).count() << std::endl;
}
{
auto start = high_resolution_clock::now();
total_count += my_count(v, val);
auto stop = high_resolution_clock::now();
std::cout << "my_count time: " << duration_cast<microseconds>(stop - start).count() << std::endl;
}
// We need this so the compiler does not prune the code above
std::cout << "Total items: " << total_count << std::endl;
}
With MinGW I get this:
std::count time: 65827
my_count time: 64861
And with MSVC I get a pretty weird result:
std::count time: 65532
my_count time: 28584
The MinGW result seems reasonable since, as far as I know, STL count function if roughly equal to the plain for loop, but the MSVC result seems weird - why the plain for loop is more than 2x faster than std::count?
These results are reproducible on my machine - it's not something that occurs once, but it occurs each time I run the code. I even tried changing the function order, running multiple for loops to avoid caching or branch prediction bias, but I still get the same result.
Is there any reason for this?

This is because MSVC vectorizes your manually written code, but is unable to do the same for std::count.
This is how vectorized code looks:
movdqa xmm5, XMMWORD PTR __xmm#00000001000000010000000100000001
and rcx, -8
xorps xmm3, xmm3
xorps xmm2, xmm2
npad 3
$LL4#my_count:
movdqu xmm1, XMMWORD PTR [rax]
add r8, 8
movdqa xmm0, xmm5
paddd xmm0, xmm3
pcmpeqd xmm1, xmm4
pand xmm0, xmm1
pandn xmm1, xmm3
movdqa xmm3, xmm0
movdqa xmm0, xmm5
por xmm3, xmm1
paddd xmm0, xmm2
movdqu xmm1, XMMWORD PTR [rax+16]
add rax, 32 ; 00000020H
pcmpeqd xmm1, xmm4
pand xmm0, xmm1
pandn xmm1, xmm2
movdqa xmm2, xmm0
por xmm2, xmm1
cmp r8, rcx
jne SHORT $LL4#my_count
You can see how it load 4 ones in xmm5 register at the beginning. This value will be used to maintain 4 separate counters that track result for 1st, 2nd, 3rd and 4th DWORDs. Once counting is done, those 4 values will be added together to form the result of the function.
The issue with MSVC vectorizer seems to lie in the fact that counter, data type and argument type should be "compatible":
Return type should match in size the data type
Argument type should be equal or less in size to data type
If any of those constraints is not met, the code is not vectorized. This makes sense as if your data type is 32-bit wide you have to operate on 32-bit counters to make them work together, so if your return type is 64-bit wide instead some additional manipulations are required (which is what GCC is able to do, but this still slows down std::count compared to manually written loop).
This is the case where manually written loop should be preferred as subtle differences in semantic (int return type) make it easier to vectorize (even for GCC, which generates shorter code).

Well, that seems to be an iterator issue.
I've made an extended test:
#include <algorithm>
#include <vector>
#include <iostream>
#include <chrono>
using namespace std::chrono;
int std_count(const std::vector<int>& v, int val) {
return std::count(v.begin(), v.end(), val);
}
int my_count_for(const std::vector<int>& v, int val) {
int num = 0;
for (int i = 0; i < v.size(); i++) {
if (v[i] == val) {
num++;
}
}
return num;
}
int my_count_for_in(const std::vector<int>& v, int val) {
int num = 0;
for (int i : v) {
if (i == val) {
num++;
}
}
return num;
}
int my_count_iter(const std::vector<int>& v, int val) {
int num = 0;
for (auto i = v.begin(); i < v.end(); i++) {
if (*i == val) {
num++;
}
}
return num;
}
int main()
{
std::vector<int> v;
v.resize(1000000);
// Fill vector
for (int i = 0; i < v.size(); i++) {
v[i] = i % 10000;
}
int val = 1;
int num_iters = 1000;
int total_count = 0;
for (int a = 0; a < 3; a++) {
{
auto start = high_resolution_clock::now();
for (int i = 0; i < num_iters; i++) {
total_count += std_count(v, val);
}
auto stop = high_resolution_clock::now();
auto duration = duration_cast<microseconds>(stop - start);
std::cout << "std::count time: " << duration.count() << std::endl;
}
{
auto start = high_resolution_clock::now();
for (int i = 0; i < num_iters; i++) {
total_count += my_count_for(v, val);
}
auto stop = high_resolution_clock::now();
auto duration = duration_cast<microseconds>(stop - start);
std::cout << "my_count_for time: " << duration.count() << std::endl;
}
{
auto start = high_resolution_clock::now();
for (int i = 0; i < num_iters; i++) {
total_count += my_count_for_in(v, val);
}
auto stop = high_resolution_clock::now();
auto duration = duration_cast<microseconds>(stop - start);
std::cout << "my_count_for_in time: " << duration.count() << std::endl;
}
{
auto start = high_resolution_clock::now();
for (int i = 0; i < num_iters; i++) {
total_count += my_count_iter(v, val);
}
auto stop = high_resolution_clock::now();
auto duration = duration_cast<microseconds>(stop - start);
std::cout << "my_count_iter time: " << duration.count() << std::endl;
}
std::cout << std::endl;
}
std::cout << total_count << std::endl;
std::cin >> total_count;
}
And here's what I get:
std::count time: 679683
my_count_for time: 235269
my_count_for_in time: 228185
my_count_iter time: 650714
std::count time: 656192
my_count_for time: 231248
my_count_for_in time: 231050
my_count_iter time: 652598
std::count time: 660295
my_count_for time: 238812
my_count_for_in time: 225893
my_count_iter time: 648812
Still seems quite weird that STL function is not the fastest way to solve the task. If someone knows the detailed answer, please share it with me.

Rounding integers routine

There is something that baffles me with integer arithmetic in tutorials. To be precise, integer division.
The seemingly preferred method is by casting the divisor into a float, then rounding the float to the nearest whole number, then cast that back into integer:
#include <cmath>
int round_divide_by_float_casting(int a, int b){
return (int) std::roundf( a / (float) b);
}
Yet this seems like scratching your left ear with your right hand. I use:
int round_divide (int a, int b){
return a / b + a % b * 2 / b;
}
It's no breakthrough, but the fact that it is not standard makes me wonder if I am missing anything?
Despite my (albeit limited) testing, I couldn't find any scenario where the two methods give me different results. Did someone run into some sort of scenario where the int → float → int casting produced more accurate results?

Arithmetic solution
If one defined what your functions should return, she would describe it as something close as "f(a, b) returns the closest integer of the result of the division of a by b in the real divisor ring."
Thus, the question can be summarized as: can we define this closest integer using only integer division. I think we can.
There is exactly two candidates as the closest integer: a / b and (a / b) + 1(1). The selection is easy, if a % b is closer to 0 as it is to b, then a / b is our result. If not, (a / b) + 1 is.
One could then write something similar to, ignoring optimization and good practices:
int divide(int a, int b)
{
const int quot = a / b;
const int rem = a % b;
int result;
if (rem < b - rem) {
result = quot;
} else {
result = quot + 1;
}
return result;
}
While this definition satisfies out needs, one could optimize it by not computing two times the division of a by b with the use of std::div():
int divide(int a, int b)
{
const std::div_t dv = std::div(a, b);
int result = dv.quot;
if (dv.rem >= b - dv.rem) {
++result;
}
return result;
}
The analysis of the problem we did earlier assures us of the well defined behaviour of our implementation.
(1)There is just one last thing to check: how does it behaves when a or b is negative? This is left to the reader ;).
Benchmark
#include <iostream>
#include <iomanip>
#include <string>
// solutions
#include <cmath>
#include <cstdlib>
// benchmak
#include <limits>
#include <random>
#include <chrono>
#include <algorithm>
#include <functional>
//
// Solutions
//
namespace
{
int round_divide_by_float_casting(int a, int b) {
return (int)roundf(a / (float)b);
}
int round_divide_by_modulo(int a, int b) {
return a / b + a % b * 2 / b;
}
int divide_by_quotient_comparison(int a, int b)
{
const std::div_t dv = std::div(a, b);
int result = dv.quot;
if (dv.rem >= b - dv.rem)
{
++result;
}
return result;
}
}
//
// benchmark
//
class Randomizer
{
std::mt19937 _rng_engine;
std::uniform_int_distribution<int> _distri;
public:
Randomizer() : _rng_engine(std::time(0)), _distri(std::numeric_limits<int>::min(), std::numeric_limits<int>::max())
{
}
template<class ForwardIt>
void operator()(ForwardIt begin, ForwardIt end)
{
std::generate(begin, end, std::bind(_distri, _rng_engine));
}
};
class Clock
{
std::chrono::time_point<std::chrono::steady_clock> _start;
public:
static inline std::chrono::time_point<std::chrono::steady_clock> now() { return std::chrono::steady_clock::now(); }
Clock() : _start(now())
{
}
template<class DurationUnit>
std::size_t end()
{
return std::chrono::duration_cast<DurationUnit>(now() - _start).count();
}
};
//
// Entry point
//
int main()
{
Randomizer randomizer;
std::array<int, 1000> dividends; // SCALE THIS UP (1'000'000 would be great)
std::array<int, dividends.size()> divisors;
std::array<int, dividends.size()> results;
randomizer(std::begin(dividends), std::end(dividends));
randomizer(std::begin(divisors), std::end(divisors));
{
Clock clock;
auto dividend = std::begin(dividends);
auto divisor = std::begin(divisors);
auto result = std::begin(results);
for ( ; dividend != std::end(dividends) ; ++dividend, ++divisor, ++result)
{
*result = round_divide_by_float_casting(*dividend, *divisor);
}
const float unit_time = clock.end<std::chrono::nanoseconds>() / static_cast<float>(results.size());
std::cout << std::setw(40) << "round_divide_by_float_casting(): " << std::setprecision(3) << unit_time << " ns\n";
}
{
Clock clock;
auto dividend = std::begin(dividends);
auto divisor = std::begin(divisors);
auto result = std::begin(results);
for ( ; dividend != std::end(dividends) ; ++dividend, ++divisor, ++result)
{
*result = round_divide_by_modulo(*dividend, *divisor);
}
const float unit_time = clock.end<std::chrono::nanoseconds>() / static_cast<float>(results.size());
std::cout << std::setw(40) << "round_divide_by_modulo(): " << std::setprecision(3) << unit_time << " ns\n";
}
{
Clock clock;
auto dividend = std::begin(dividends);
auto divisor = std::begin(divisors);
auto result = std::begin(results);
for ( ; dividend != std::end(dividends) ; ++dividend, ++divisor, ++result)
{
*result = divide_by_quotient_comparison(*dividend, *divisor);
}
const float unit_time = clock.end<std::chrono::nanoseconds>() / static_cast<float>(results.size());
std::cout << std::setw(40) << "divide_by_quotient_comparison(): " << std::setprecision(3) << unit_time << " ns\n";
}
}
Outputs:
g++ -std=c++11 -O2 -Wall -Wextra -Werror main.cpp && ./a.out
round_divide_by_float_casting(): 54.7 ns
round_divide_by_modulo(): 24 ns
divide_by_quotient_comparison(): 25.5 ns
Demo
The two arithmetic solutions' performances are not distinguishable (their benchmark converges when you scale the bench size up).

It would really depend on the processor, and the range of the integer which is better (and using double would resolve most of the range issues)
For modern "big" CPUs like x86-64 and ARM, integer division and floating point division are roughly the same effort, and converting an integer to a float or vice versa is not a "hard" task (and does the correct rounding directly in that conversion, at least), so most likely the resulting operations are.
atmp = (float) a;
btmp = (float) b;
resfloat = divide atmp/btmp;
return = to_int_with_rounding(resfloat)
About four machine instructions.
On the other hand, your code uses two divides, one modulo and a multiply, which is quite likely longer on such a processor.
tmp = a/b;
tmp1 = a % b;
tmp2 = tmp1 * 2;
tmp3 = tmp2 / b;
tmp4 = tmp + tmp3;
So five instructions, and three of those are "divide" (unless the compiler is clever enough to reuse a / b for a % b - but it's still two distinct divides).
Of course, if you are outside the range of number of digits that a float or double can hold without losing digits (23 bits for float, 53 bits for double), then your method MAY be better (assuming there is no overflow in the integer math).
On top of all that, since the first form is used by "everyone", it's the one that the compiler recognises and can optimise.
Obviously, the results depend on both the compiler being used and the processor it runs on, but these are my results from running the code posted above, compiled through clang++ (v3.9-pre-release, pretty close to released 3.8).
round_divide_by_float_casting(): 32.5 ns
round_divide_by_modulo(): 113 ns
divide_by_quotient_comparison(): 80.4 ns
However, the interesting thing I find when I look at the generated code:
xorps %xmm0, %xmm0
cvtsi2ssl 8016(%rsp,%rbp), %xmm0
xorps %xmm1, %xmm1
cvtsi2ssl 4016(%rsp,%rbp), %xmm1
divss %xmm1, %xmm0
callq roundf
cvttss2si %xmm0, %eax
movl %eax, 16(%rsp,%rbp)
addq $4, %rbp
cmpq $4000, %rbp # imm = 0xFA0
jne .LBB0_7
is that the round is actually a call. Which really surprises me, but explains why on some machines (particularly more recent x86 processors), it is faster.
g++ gives better results with -ffast-math, which gives around:
round_divide_by_float_casting(): 17.6 ns
round_divide_by_modulo(): 43.1 ns
divide_by_quotient_comparison(): 18.5 ns
(This is with increased count to 100k values)

Prefer the standard solution. Use std::div family of functions declared in cstdlib.
See: http://en.cppreference.com/w/cpp/numeric/math/div
Casting to float and then to int may be very inefficient on some architectures, for example, microcontrollers.

Thanks for the suggestions so far. To shed some light I made a test setup to compare performance.
#include <iostream>
#include <string>
#include <cmath>
#include <cstdlib>
#include <chrono>
using namespace std;
int round_divide_by_float_casting(int a, int b) {
return (int)roundf(a / (float)b);
}
int round_divide_by_modulo(int a, int b) {
return a / b + a % b * 2 / b;
}
int divide_by_quotient_comparison(int a, int b)
{
const std::div_t dv = std::div(a, b);
int result = dv.quot;
if (dv.rem <= b - dv.rem) {
++result;
}
return result;
}
int main()
{
int itr = 1000;
//while (true) {
auto begin = chrono::steady_clock::now();
for (int i = 0; i < itr; i++) {
for (int j = 10; j < itr + 1; j++) {
divide_by_quotient_comparison(i, j);
}
}
auto end = std::chrono::steady_clock::now();
cout << "divide_by_quotient_comparison(,) function took: "
<< chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count()
<< endl;
begin = chrono::steady_clock::now();
for (int i = 0; i < itr; i++) {
for (int j = 10; j < itr + 1; j++) {
round_divide_by_float_casting(i, j);
}
}
end = std::chrono::steady_clock::now();
cout << "round_divide_by_float_casting(,) function took: "
<< chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count()
<< endl;
begin = chrono::steady_clock::now();
for (int i = 0; i < itr; i++) {
for (int j = 10; j < itr + 1; j++) {
round_divide_by_modulo(i, j);
}
}
end = std::chrono::steady_clock::now();
cout << "round_divide_by_modulo(,) function took: "
<< chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count()
<< endl;
//}
return 0;
}
The results I got on my machine (i7 with Visual Studio 2015) was as follows: the modulo arithmetic was about twice as fast as the int → float → int casting method. The method relying on std::div_t (suggested by #YSC and #teroi) was faster than the int → float → int, but slower than the modulo arithmetic method.
A second test was performed to avoid certain compiler optimizations pointed out by #YSC:
#include <iostream>
#include <string>
#include <cmath>
#include <cstdlib>
#include <chrono>
#include <vector>
using namespace std;
int round_divide_by_float_casting(int a, int b) {
return (int)roundf(a / (float)b);
}
int round_divide_by_modulo(int a, int b) {
return a / b + a % b * 2 / b;
}
int divide_by_quotient_comparison(int a, int b)
{
const std::div_t dv = std::div(a, b);
int result = dv.quot;
if (dv.rem <= b - dv.rem) {
++result;
}
return result;
}
int main()
{
int itr = 100;
vector <int> randi, randj;
for (int i = 0; i < itr; i++) {
randi.push_back(rand());
int rj = rand();
if (rj == 0)
rj++;
randj.push_back(rj);
}
vector<int> f, m, q;
while (true) {
auto begin = chrono::steady_clock::now();
for (int i = 0; i < itr; i++) {
for (int j = 0; j < itr; j++) {
q.push_back( divide_by_quotient_comparison(randi[i] , randj[j]) );
}
}
auto end = std::chrono::steady_clock::now();
cout << "divide_by_quotient_comparison(,) function took: "
<< chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count()
<< endl;
begin = chrono::steady_clock::now();
for (int i = 0; i < itr; i++) {
for (int j = 0; j < itr; j++) {
f.push_back( round_divide_by_float_casting(randi[i], randj[j]) );
}
}
end = std::chrono::steady_clock::now();
cout << "round_divide_by_float_casting(,) function took: "
<< chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count()
<< endl;
begin = chrono::steady_clock::now();
for (int i = 0; i < itr; i++) {
for (int j = 0; j < itr; j++) {
m.push_back( round_divide_by_modulo(randi[i], randj[j]) );
}
}
end = std::chrono::steady_clock::now();
cout << "round_divide_by_modulo(,) function took: "
<< chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count()
<< endl;
cout << endl;
f.clear();
m.clear();
q.clear();
}
return 0;
}
In this second test the slowest was the divide_by_quotient() reliant on std::div_t, followed by divide_by_float(), and the fastest again was the divide_by_modulo(). However this time the performance difference was much, much lower, less than 20%.

_mm_load_ps caused segment fault

I have a code snippet. The snippet just loads 2 arrays and calculates dot product between them using SSE.
Code here:
using namespace std;
long long size = 3200000;
float* _random()
{
unsigned int seed = 123;
// float *t = malloc(size*sizeof(float));
float *t = new float[size];
int i;
float num = 0.0;
for(i=0; i < size; i++) {
num = rand()/(RAND_MAX+1.0);
t[i] = num;
}
return t;
}
float _dotProductVectorSSE(float *s1, float *s2)
{
float prod;
int i;
__m128 X, Y, Z;
for(i=0; i<size; i+=4)
{
X = _mm_load_ps(&s1[i]);
Y = _mm_load_ps(&s2[i]);
X = _mm_mul_ps(X, Y);
Z = _mm_add_ps(X, Z);
}
float *v = new float[4];
_mm_store_ps(v,Z);
for(i=0; i<4; i++)
{
// prod += Z[i];
std::cout << v[i] << endl;
}
return prod;
}
int main(int argc, char *argv[])
{
QCoreApplication a(argc, argv);
time_t start, stop;
double avg_time = 0;
double cur_time;
float* s1 = NULL;
float* s2 = NULL;
for(int i = 0; i < 100; i++)
{
s1 = _random();
s2 = _random();
start = clock();
float sse_product = _dotProductVectorSSE(s1, s2);
stop = clock();
cur_time = ((double) stop-start) / CLOCKS_PER_SEC;
avg_time += cur_time;
}
std::cout << "Averagely used " << avg_time/100 << " seconds." << endl;
return a.exec();
}
When I run, I got segment fault. Here is the backtrace:
(gdb) bt
0 0x0804965f in _mm_load_ps (__P=0xb6b56008) at /usr/lib/gcc/i586-suse-linux/4.6/include/xmmintrin.h:899
1 _dotProductVectorSSE (s1=0xb6b56008, s2=0xb5f20008) at ../simd/simd.cpp:37
2 0x0804987f in main (argc=1, argv=0xbfffee84) at ../simd/simd.cpp:80
Diassembler:
0x8049b30 push %ebp
0x8049b31 <+0x0001> push %edi
0x8049b32 <+0x0002> push %esi
0x8049b33 <+0x0003> push %ebx
0x8049b34 <+0x0004> sub $0x2c,%esp
0x8049b37 <+0x0007> mov 0x804c0a4,%esi
0x8049b3d <+0x000d> mov 0x40(%esp),%edx
0x8049b41 <+0x0011> mov 0x44(%esp),%ecx
0x8049b45 <+0x0015> mov 0x804c0a0,%ebx
0x8049b4b <+0x001b> cmp $0x0,%esi
0x8049b4e <+0x001e> jl 0x8049b7a <_Z20_dotProductVectorSSEPfS_+74>
0x8049b50 <+0x0020> jle 0x8049c10 <_Z20_dotProductVectorSSEPfS_+224>
0x8049b56 <+0x0026> add $0xffffffff,%ebx
0x8049b59 <+0x0029> adc $0xffffffff,%esi
0x8049b5c <+0x002c> xor %eax,%eax
0x8049b5e <+0x002e> shrd $0x2,%esi,%ebx
0x8049b62 <+0x0032> add $0x1,%ebx
0x8049b65 <+0x0035> shl $0x2,%ebx
**0x8049b68 <+0x0038> movaps (%edx,%eax,4),%xmm0**
0x8049b6c <+0x003c> mulps (%ecx,%eax,4),%xmm0
0x8049b70 <+0x0040> add $0x4,%eax
0x8049b73 <+0x0043> cmp %ebx,%eax
0x8049b75 <+0x0045> addps %xmm0,%xmm1
0x8049b78 <+0x0048> jne 0x8049b68 <_Z20_dotProductVectorSSEPfS_+56>
0x8049b7a <+0x004a> movaps %xmm1,0x10(%esp)
0x8049b7f <+0x004f> xor %ebx,%ebx
I am using QtCreator and defined in .pro file:
QMAKE_CXXFLAGS += -msse -msse2
DEFINES += __SSE__
DEFINES += __SSE2__
DEFINES += __MMX__
Please tell me how to fix that problem !

You are not ensuring that your data is 16 byte aligned (malloc/new are not sufficient in general) - you will either need to use _mm_loadu_ps instead of _mm_load_ps to deal with your potentially misaligned data, or preferably use a suitable method to allocate aligned memory (e.g. posix_memalign on Linux).
Note that you should _mm_load_ps and 16 byte aligned memory if you possibly can, otherwise use _mm_loadu_ps but note that this may reduce performance signficantly on some (older) CPUs.

Try the link below.
http://flyeater.wordpress.com/2010/11/29/memory-allocation-and-data-alignment-custom-mallocfree/
You basically allocate a bit more memory than you need, then calculate the address which is modulo 16 and use memory beginning from that address to load/store data.
Take care of pointer arithmetic.
Most of the code here ideone.com/fXKQhR is taken from the above link, sample usage.

I think, the _mm_malloc maybe helpful with you.

Can counting byte matches between two strings be optimized using SIMD?

Profiling suggests that this function here is a real bottle neck for my application:
static inline int countEqualChars(const char* string1, const char* string2, int size) {
int r = 0;
for (int j = 0; j < size; ++j) {
if (string1[j] == string2[j]) {
++r;
}
}
return r;
}
Even with -O3 and -march=native, G++ 4.7.2 does not vectorize this function (I checked the assembler output). Now, I'm not an expert with SSE and friends, but I think that comparing more than one character at once should be faster. Any ideas on how to speed things up? Target architecture is x86-64.

Of course it can.
pcmpeqb compares two vectors of 16 bytes and produces a vector with zeros where they differed, and -1 where they match. Use this to compare 16 bytes at a time, adding the result to an accumulator vector (make sure to accumulate the results of at most 255 vector compares to avoid overflow). When you're done, there are 16 results in the accumulator. Sum them and negate to get the number of equal elements.
If the lengths are very short, it will be hard to get a significant speedup from this approach. If the lengths are long, then it will be worth pursuing.

Compiler flags for vectorization:
-ftree-vectorize
-ftree-vectorize -march=<your_architecture> (Use all instruction-set extensions available on your computer, not just baseline like SSE2 for x86-64). Use -march=native to optimize for the machine the compiler is running on.) -march=<foo> also sets -mtune=<foo>, which is also a good thing.
Using SSEx intrinsics:
Padd and align the buffer to 16 bytes (according to the vector size you're actually going to use)
Create an accumlator countU8 with _mm_set1_epi8(0)
For all n/16 input (sub) vectors, do:
Load 16 chars from both strings with _mm_load_si128 or _mm_loadu_si128 (for unaligned loads)
_mm_cmpeq_epi8
compare the octets in parallel. Each match yields 0xFF (-1), 0x00 otherwise.
Substract the above result vector from countU8 using _mm_sub_epi8 (minus -1 -> +1)
Always after 255 cycles, the 16 8bit counters must be extracted into a larger integer type to prevent overflows. See unpack and horizontal add in this nice answer for how to do that: https://stackoverflow.com/a/10930706/1175253
Code:
#include <iostream>
#include <vector>
#include <cassert>
#include <cstdint>
#include <climits>
#include <cstring>
#include <emmintrin.h>
#ifdef __SSE2__
#if !defined(UINTPTR_MAX) || !defined(UINT64_MAX) || !defined(UINT32_MAX)
# error "Limit macros are not defined"
#endif
#if UINTPTR_MAX == UINT64_MAX
#define PTR_64
#elif UINTPTR_MAX == UINT32_MAX
#define PTR_32
#else
# error "Current UINTPTR_MAX is not supported"
#endif
template<typename T>
void print_vector(std::ostream& out,const __m128i& vec)
{
static_assert(sizeof(vec) % sizeof(T) == 0,"Invalid element size");
std::cout << '{';
const T* const end = reinterpret_cast<const T*>(&vec)-1;
const T* const upper = end+(sizeof(vec)/sizeof(T));
for(const T* elem = upper;
elem != end;
--elem
)
{
if(elem != upper)
std::cout << ',';
std::cout << +(*elem);
}
std::cout << '}' << std::endl;
}
#define PRINT_VECTOR(_TYPE,_VEC) do{ std::cout << #_VEC << " : "; print_vector<_TYPE>(std::cout,_VEC); } while(0)
///#note SSE2 required (macro: __SSE2__)
///#warning Not tested!
size_t counteq_epi8(const __m128i* a_in,const __m128i* b_in,size_t count)
{
assert(a_in != nullptr && (uintptr_t(a_in) % 16) == 0);
assert(b_in != nullptr && (uintptr_t(b_in) % 16) == 0);
//assert(count > 0);
/*
//maybe not so good with all that branching and additional loop variables
__m128i accumulatorU8 = _mm_set1_epi8(0);
__m128i sum2xU64 = _mm_set1_epi8(0);
for(size_t i = 0;i < count;++i)
{
//this operation could also be unrolled, where multiple result registers would be accumulated
accumulatorU8 = _mm_sub_epi8(accumulatorU8,_mm_cmpeq_epi8(*a_in++,*b_in++));
if(i % 255 == 0)
{
//before overflow of uint8, the counter will be extracted
__m128i sum2xU16 = _mm_sad_epu8(accumulatorU8,_mm_set1_epi8(0));
sum2xU64 = _mm_add_epi64(sum2xU64,sum2xU16);
//reset accumulatorU8
accumulatorU8 = _mm_set1_epi8(0);
}
}
//blindly accumulate remaining values
__m128i sum2xU16 = _mm_sad_epu8(accumulatorU8,_mm_set1_epi8(0));
sum2xU64 = _mm_add_epi64(sum2xU64,sum2xU16);
//do a horizontal addition of the two counter values
sum2xU64 = _mm_add_epi64(sum2xU64,_mm_srli_si128(sum2xU64,64/8));
#if defined PTR_64
return _mm_cvtsi128_si64(sum2xU64);
#elif defined PTR_32
return _mm_cvtsi128_si32(sum2xU64);
#else
# error "macro PTR_(32|64) is not set"
#endif
*/
__m128i sum2xU64 = _mm_set1_epi32(0);
while(count--)
{
__m128i matches = _mm_sub_epi8(_mm_set1_epi32(0),_mm_cmpeq_epi8(*a_in++,*b_in++));
__m128i sum2xU16 = _mm_sad_epu8(matches,_mm_set1_epi32(0));
sum2xU64 = _mm_add_epi64(sum2xU64,sum2xU16);
#ifndef NDEBUG
PRINT_VECTOR(uint16_t,sum2xU64);
#endif
}
//do a horizontal addition of the two counter values
sum2xU64 = _mm_add_epi64(sum2xU64,_mm_srli_si128(sum2xU64,64/8));
#ifndef NDEBUG
std::cout << "----------------------------------------" << std::endl;
PRINT_VECTOR(uint16_t,sum2xU64);
#endif
#if !defined(UINTPTR_MAX) || !defined(UINT64_MAX) || !defined(UINT32_MAX)
# error "Limit macros are not defined"
#endif
#if defined PTR_64
return _mm_cvtsi128_si64(sum2xU64);
#elif defined PTR_32
return _mm_cvtsi128_si32(sum2xU64);
#else
# error "macro PTR_(32|64) is not set"
#endif
}
#endif
int main(int argc, char* argv[])
{
std::vector<__m128i> a(64); // * 16 bytes
std::vector<__m128i> b(a.size());
const size_t nBytes = a.size() * sizeof(std::vector<__m128i>::value_type);
char* const a_out = reinterpret_cast<char*>(a.data());
char* const b_out = reinterpret_cast<char*>(b.data());
memset(a_out,0,nBytes);
memset(b_out,0,nBytes);
a_out[1023] = 1;
b_out[1023] = 1;
size_t equalBytes = counteq_epi8(a.data(),b.data(),a.size());
std::cout << "equalBytes = " << equalBytes << std::endl;
return 0;
}
The fastest SSE implementation I got for large and small arrays:
size_t counteq_epi8(const __m128i* a_in,const __m128i* b_in,size_t count)
{
assert((count > 0 ? a_in != nullptr : true) && (uintptr_t(a_in) % sizeof(__m128i)) == 0);
assert((count > 0 ? b_in != nullptr : true) && (uintptr_t(b_in) % sizeof(__m128i)) == 0);
//assert(count > 0);
const size_t maxInnerLoops = 255;
const size_t nNestedLoops = count / maxInnerLoops;
const size_t nRemainderLoops = count % maxInnerLoops;
const __m128i zero = _mm_setzero_si128();
__m128i sum16xU8 = zero;
__m128i sum2xU64 = zero;
for(size_t i = 0;i < nNestedLoops;++i)
{
for(size_t j = 0;j < maxInnerLoops;++j)
{
sum16xU8 = _mm_sub_epi8(sum16xU8,_mm_cmpeq_epi8(*a_in++,*b_in++));
}
sum2xU64 = _mm_add_epi64(sum2xU64,_mm_sad_epu8(sum16xU8,zero));
sum16xU8 = zero;
}
for(size_t j = 0;j < nRemainderLoops;++j)
{
sum16xU8 = _mm_sub_epi8(sum16xU8,_mm_cmpeq_epi8(*a_in++,*b_in++));
}
sum2xU64 = _mm_add_epi64(sum2xU64,_mm_sad_epu8(sum16xU8,zero));
sum2xU64 = _mm_add_epi64(sum2xU64,_mm_srli_si128(sum2xU64,64/8));
#if UINTPTR_MAX == UINT64_MAX
return _mm_cvtsi128_si64(sum2xU64);
#elif UINTPTR_MAX == UINT32_MAX
return _mm_cvtsi128_si32(sum2xU64);
#else
# error "macro PTR_(32|64) is not set"
#endif
}

Auto-vectorization in current gcc is a matter of helping the compiler to understand that's easy to vectorize the code. In your case: it will understand the vectorization request if you remove the conditional and rewrite the code in a more imperative way:
static inline int count(const char* string1, const char* string2, int size) {
int r = 0;
bool b;
for (int j = 0; j < size; ++j) {
b = (string1[j] == string2[j]);
r += b;
}
return r;
}
In this case:
movdqa 16(%rsp), %xmm1
movl $.LC2, %esi
pxor %xmm2, %xmm2
movzbl 416(%rsp), %edx
movdqa .LC1(%rip), %xmm3
pcmpeqb 224(%rsp), %xmm1
cmpb %dl, 208(%rsp)
movzbl 417(%rsp), %eax
movl $1, %edi
pand %xmm3, %xmm1
movdqa %xmm1, %xmm5
sete %dl
movdqa %xmm1, %xmm4
movzbl %dl, %edx
punpcklbw %xmm2, %xmm5
punpckhbw %xmm2, %xmm4
pxor %xmm1, %xmm1
movdqa %xmm5, %xmm6
movdqa %xmm5, %xmm0
movdqa %xmm4, %xmm5
punpcklwd %xmm1, %xmm6
(etc.)

Timed vector vs map vs unordered_map lookup

I was curious on vector lookup vs map lookup and wrote a little test program for it.. its seems like vector is always faster the way I'm using it.. is there something else I should take into consideration here? Is the test biased in any way? The results of a run is at the bottom.. its in nanoseconds, but gcc doesn't seem to support it on my platform.
Using string for the lookup would of course change things a lot.
The compile line I'm using is this: g++ -O3 --std=c++0x -o lookup lookup.cpp
#include <iostream>
#include <vector>
#include <map>
#include <unordered_map>
#include <chrono>
#include <algorithm>
unsigned dummy = 0;
class A
{
public:
A(unsigned id) : m_id(id){}
unsigned id(){ return m_id; }
void func()
{
//making sure its not optimized away
dummy++;
}
private:
unsigned m_id;
};
class B
{
public:
void func()
{
//making sure its not optimized away
dummy++;
}
};
int main()
{
std::vector<A> v;
std::unordered_map<unsigned, B> u;
std::map<unsigned, B> m;
unsigned elementCount = 1;
struct Times
{
unsigned long long v;
unsigned long long u;
unsigned long long m;
};
std::map<unsigned, Times> timesMap;
while(elementCount != 10000000)
{
elementCount *= 10;
for(unsigned i = 0; i < elementCount; ++i)
{
v.emplace_back(A(i));
u.insert(std::make_pair(i, B()));
m.insert(std::make_pair(i, B()));
}
std::chrono::time_point<std::chrono::steady_clock> start = std::chrono::high_resolution_clock::now();
for(unsigned i = 0; i < elementCount; ++i)
{
auto findItr = std::find_if(std::begin(v), std::end(v),
[&i](A & a){ return a.id() == i; });
findItr->func();
}
auto tp0 = std::chrono::high_resolution_clock::now()- start;
unsigned long long vTime = std::chrono::duration_cast<std::chrono::nanoseconds>(tp0).count();
start = std::chrono::high_resolution_clock::now();
for(unsigned i = 0; i < elementCount; ++i)
{
u[i].func();
}
auto tp1 = std::chrono::high_resolution_clock::now()- start;
unsigned long long uTime = std::chrono::duration_cast<std::chrono::nanoseconds>(tp1).count();
start = std::chrono::high_resolution_clock::now();
for(unsigned i = 0; i < elementCount; ++i)
{
m[i].func();
}
auto tp2 = std::chrono::high_resolution_clock::now()- start;
unsigned long long mTime = std::chrono::duration_cast<std::chrono::nanoseconds>(tp2).count();
timesMap.insert(std::make_pair(elementCount ,Times{vTime, uTime, mTime}));
}
for(auto & itr : timesMap)
{
std::cout << "Element count: " << itr.first << std::endl;
std::cout << "std::vector time: " << itr.second.v << std::endl;
std::cout << "std::unordered_map time: " << itr.second.u << std::endl;
std::cout << "std::map time: " << itr.second.m << std::endl;
std::cout << "-----------------------------------" << std::endl;
}
std::cout << dummy;
}
./lookup
Element count: 10
std::vector time: 0
std::unordered_map time: 0
std::map time: 1000
-----------------------------------
Element count: 100
std::vector time: 0
std::unordered_map time: 3000
std::map time: 13000
-----------------------------------
Element count: 1000
std::vector time: 2000
std::unordered_map time: 29000
std::map time: 138000
-----------------------------------
Element count: 10000
std::vector time: 22000
std::unordered_map time: 287000
std::map time: 1610000
-----------------------------------
Element count: 100000
std::vector time: 72000
std::unordered_map time: 1539000
std::map time: 8994000
-----------------------------------
Element count: 1000000
std::vector time: 746000
std::unordered_map time: 12654000
std::map time: 154060000
-----------------------------------
Element count: 10000000
std::vector time: 8001000
std::unordered_map time: 123608000
std::map time: 2279362000
-----------------------------------
33333330

I'm not at all shocked the vector tested better than anything else. The asm code for it (actual disassembly) breaks down to this (on my Apple LLVM 4.2 at full opt):
0x100001205: callq 0x100002696 ; symbol stub for: std::__1::chrono::steady_clock::now()
0x10000120a: testl %r13d, %r13d
0x10000120d: leaq -272(%rbp), %rbx
0x100001214: je 0x100001224 ; main + 328 at main.cpp:78
0x100001216: imull $10, %r14d, %ecx
0x10000121a: incl 7896(%rip) ; dummy
0x100001220: decl %ecx
0x100001222: jne 0x10000121a ; main + 318 [inlined] A::func() at main.cpp:83
main + 318 at main.cpp:83
0x100001224: movq %rax, -280(%rbp)
0x10000122b: callq 0x100002696 ; symbol stub for: std::__1::chrono::
Note the 'loop' (the jne 0x10000121a). The "find_if" has been completely optimized out, and the result is effectively a sweep over the array with a decrementing register to count how many times to increment the global. Thats all that is being done; there is no searching of any kind undergone in this.
So yeah, its how you're using it.

First, you don't seem to clear your containers between tests. So they don't contain what you think they do.
Second, according to your times, your vector exhibits linear time, which is something that just can't be, as complexity is O(N*N) in your algorithm. Probably it WAS optimized away. Instead of trying to combat optimization, I would suggest just turning it off.
Third, your values are too predictable for a vector. This can impact it dramatically. Try random values (or a random_shuffle())

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Fastest way to lookup bytes from 64-byte arrays - c++

Your code is already really fast. However (on my system) the execution is about 4.858 % faster when you change const int n_indices = 1000000; to const int n_indices = 1048576; // 2^10 This is not much, but it's something.

Related

Why std::count is slower that plain for loop with MSVC Compiler but equal with GCC?

Rounding integers routine

_mm_load_ps caused segment fault

Can counting byte matches between two strings be optimized using SIMD?

Timed vector vs map vs unordered_map lookup

Categories

Resources