Faster bit reading?

Faster bit reading? - c++

In my application 20% of cpu time is spent on reading bits (skip) through my bit reader. Does anyone have any idea on how one might make the following code faster? At any given time, I do not need more than 20 valid bits (which is why I, in some situations, can use fast_skip).
Bits are read in big-endian order, which is why the byte swap is needed.
class bit_reader
{
std::uint32_t* m_data;
std::size_t m_pos;
std::uint64_t m_block;
public:
bit_reader(void* data)
: m_data(reinterpret_cast<std::uint32_t*>(data))
, m_pos(0)
, m_block(_byteswap_uint64(*reinterpret_cast<std::uint64_t*>(data)))
{
}
std::uint64_t value(std::size_t n_bits = 64)
{
assert(m_pos + n_bits < 64);
return (m_block << m_pos) >> (64 - n_bits);
}
void skip(std::size_t n_bits) // 20% cpu time
{
assert(m_pos + n_bits < 42);
m_pos += n_bits;
if(m_pos > 31)
{
m_block = _byteswap_uint64(reinterpret_cast<std::uint64_t*>(++m_data)[0]);
m_pos -= 32;
}
}
void fast_skip(std::size_t n_bits)
{
assert(m_pos + n_bits < 42);
m_pos += n_bits;
}
};
Target hardware is x64.

I see from an earlier comment you are unpacking Huffman/arithmetic coded streams in JPEG.
skip() and value() are really simple enough to be inlined. There's a chance that the compiler will keep the shift register and buffer pointers in registers the whole while. Making all pointers here and in the caller with the restrict modifier might help by telling the compiler that you won't be writing the results of Huffman decoding into the bit-buffer, thus allowing further optimisation.
The average length of each Huffman/artimetic symbol is short - so, ~7 times out of 8, you won't need to top up the 64-bit shift register. Investigate giving the compiler a branch-prediction hint.
It's unusual for any symbol in the JPEG bitstream to be longer than 32-bits. Does this allow further optimization?
One very logical reason that skip() is a heavy path is that you're calling it a lot. You are consuming an entire symbol at once rather than every bit here aren't you? There are some clever tricks you can do by counting leading 0 or 1s in symbols and table lookup.
You might consider arranging your shift register such that the next bit in the stream is the LSB. This will avoid the shifts in value()

Shifting for 64 bits is definitely not a good idea. In many CPUs shift is a slow operation.
I would advise you to change your code to a byte addressing. This will limit the shift for 8 bits maximum.
In many cases you really do not need a bit by itself, but rather to check if it is present or not. This can be done with a code like:
if (data[bit_inx/64] & mask[bit_inx % 64])
{
....
}

Try substituting this line in skip:
m_block = (m_block << 32) | _byteswap_uint32(*++m_data);

I don't know if it's the cause and what the underlying implementation of _byteswap_uint64 looks like, but you should read Rob Pike's article on byteorder. Maybe that's your answer.
Abstract: endianness is less of a problem than it's often made up to be. And the implementation for byte order swapping often come with issues. But there's a simple alternative.
[EDIT] I've got a better theory. Pasted from my comment below:
Maybe it's aliasing. 64 bit architectures love to align the data by 64 bits, when you read across alignment boundaries, it gets pretty slow. So it could be the (++m_data)[0] part, as x64 is 64 bit aligned and when you reinterpret_cast a uint32_t* to uint64_t*, you are crossing alignment boundaries about half of the time.

If your source buffers are not huge, then you should pre-process them, byte-swap the buffers before you access them using the bit_reader!
Reading from your bit_reader will be much faster then, because:
you will save some conditional instructions
the CPU caches can be used more efficiently: it can read straight from memory, which is most probably already loaded into cpu cache, instead of reading from memory that will be modified after reading each 64bit chunk, and so, destroy the benefits of having had it in cache
EDIT
Oh wait, you do not modify the source buffer. However, putting the byteswap into a pre-processing stage should at least be worth a try.
Another point: make sure those assert() calls will only be in debug version.
EDIT 2
(deleted)
EDIT 3
Your code is definitely flawed, check the following usage scenario:
uint32_t source[] = { 0x00112233, 0x44556677, 0x8899AABB, 0xCCDDEEFF };
bit_reader br(source); // -> m_block = 0x7766554433221100
// reading...
br.value(16); // -> 0x77665544
br.skip(16);
br.value(16); // -> 0x33221100
br.skip(16); // -> triggers reading more bits
// -> m_block = 0xBBAA998877665544, m_pos = 0
br.value(16); // -> 0xBBAA9988
br.skip(16);
br.value(16); // -> 0x77665544
// that's not what you expect, right ???
EDIT 4
Well, no, EDIT 3 was wrong, but I can not help, the code is flawed. Isn't it?
uint32_t source[] = { 0x00112233, 0x44556677, 0x8899AABB, 0xCCDDEEFF };
bit_reader br(source); // -> m_block = 0x7766554433221100
// reading...
br.value(16); // -> 0x7766
br.skip(16);
br.value(16); // -> 0x5544
br.skip(16); // -> triggers reading more bits (because m_pos=32, which is: m_pos>31)
// -> m_block = 0xBBAA998877665544, m_pos = 0
br.value(16); // -> 0xBBAA --> not what you expect, right?

Here is another version I tried, which didn't give any performance improvements.
class bit_reader
{
public:
const std::uint64_t* m_data64;
std::size_t m_pos64;
std::uint64_t m_block0;
std::uint64_t m_block1;
bit_reader(const void* data)
: m_pos64(0)
, m_data64(reinterpret_cast<const std::uint64_t*>(data))
, m_block0(byte_swap(*m_data64++))
, m_block1(byte_swap(*m_data64++))
{
}
std::uint64_t value(std::size_t n_bits = 64)
{
return __shiftleft128(m_block1, m_block0, m_pos64) >> (64 - n_bits);
}
void skip(std::size_t n_bits)
{
m_pos64 += n_bits;
if(m_pos64 > 63)
{
m_block0 = m_block1;
m_block1 = byte_swap(*m_data64++);
m_pos64 -= 64;
}
}
void fast_skip(std::size_t n_bits)
{
skip(n_bits);
}
};

If possible it would be best to do this in multiple passes. Multiple runs can be optimized and reduced breaching.
In general it is best to do
const uint64_t * arr = data;
for(uint64_t * i = arr; i != &arr[len/sizeof(uint64_t)] ;i++)
{
*i = _byteswap_uint64(*i);
//no more operations here
}
// another similar for loop
Such code can reduce run time by huge factor
At worst you can do it in like runs of 100k blocks, to keep cache misses at minimum and single loading of data from RAM.
In your case you do it in streaming way witch is good only for keeping low memory and faster responses from slow data source, but not for speed.

Related

Concatenate binary bit-sized strings

I want to write to a file a series of binary strings whose length is expressed in bits rather than bytes. Take in consideration two strings s1 and s2 that in binary are respectively 011 and 01011. In this case the contents of the output file has to be: 01101011 (1 byte). I am trying to do this in the most efficient way possible since I have several million strings to concatenate for a total of several GB in output.

C++ has no way of working directly with bits because it aims at being a light layer
above the hardware and the hardware itself is not bit oriented. The very minimum
amount of bits you can read/write in one operation is a byte (normally 8 bits).
Also if you need to do disk i/o it's better to write your data in blocks instead of one byte at a time. The library has some buffering, but the earlier things are buffered the faster the code will be (less code is involved in passing data around).
A simple approach could be
unsigned char iobuffer[4096];
int bufsz; // how many bytes are present in the buffer
unsigned long long bit_accumulator;
int acc_bits; // how many bits are present in the accumulator
void writeCode(unsigned long long code, int bits) {
bit_accumulator |= code << acc_bits;
acc_bits += bits;
while (acc_bits >= 8) {
iobuffer[bufsz++] = bit_accumulator & 255;
bit_accumulator >>= 8;
acc_bits -= 8;
if (bufsz == sizeof(iobuffer)) {
// Write the buffer to disk
bufsz = 0;
}
}
}

There is no optimal way to solve your problem per se. But you can use a few pinches to speed things up:
Experiment with the file I/O sync flag. It might be that set/unset is significantly faster that the other, because of buffering and caching.
Try to use architecture sized variables so that they fit into the registers directly: uint32_t for 32 bit machines and uint64_t for 64 bit machines ...
"Volatile" might help to, keep things in registers
Use pointer and references for large data and copy small data blobs (to avoid unnecessary copy of large data and much lookups and page touching for small data)
Use mmap of the file for direct access and align your output to the page size of your architecture and hard disk (usually 4 KiB = 4096 Bytes)
Try to reduce branching (instructions like "if", "for", "while", "() ? :") and linearize your code.
And if that is not enough and when the going gets rough: Use assembler (but I would not recommend that for beginners)
I think multi threading would be contra productive in this case, because of the limited file writes that can be issued and the problem is not easy dividable into little tasks as each one needs to know how many bits after the other ones it has to start and then you would have to join all the results together in the end.

I've used the following in the past, it might help a bit...
FileWriter.h:
#ifndef FILE_WRITER_H
#define FILE_WRITER_H
#include <stdio.h>
class FileWriter
{
public:
FileWriter(const char* pFileName);
virtual ~FileWriter();
void AddBit(int iBit);
private:
FILE* m_pFile;
unsigned char m_iBitSeq;
unsigned char m_iBitSeqLen;
};
#endif
FileWriter.cpp:
#include "FileWriter.h"
#include <limits.h>
FileWriter::FileWriter(const char* pFileName)
{
m_pFile = fopen(pFileName,"wb");
m_iBitSeq = 0;
m_iBitSeqLen = 0;
}
FileWriter::~FileWriter()
{
while (m_iBitSeqLen > 0)
AddBit(0);
fclose(m_pFile);
}
void FileWriter::AddBit(int iBit)
{
m_iBitSeq |= iBit<<CHAR_BIT;
m_iBitSeq >>= 1;
m_iBitSeqLen++;
if (m_iBitSeqLen == CHAR_BIT)
{
fwrite(&m_iBitSeq,1,1,m_pFile);
m_iBitSeqLen = 0;
}
}
You can further improve it by accumulating the data up to a certain amount before writing it into the file.

Loop Around File Mapping Kills Performance

I have a circular buffer which is backed with file mapped memory (the buffer is in the size range of 8GB-512GB).
I am writing to (8 instances of) this memory in a sequential manner from the beginning to the end at which point it loops around back to the beginning.
It works fine until it reaches the end where it needs to perform two file mappings and loop around the memory, at which point IO performance is totally trashed and doesn't recover (even after several minutes). I can't quite figure it out.
using namespace boost::interprocess;
class mapping
{
public:
mapping()
{
}
mapping(file_mapping& file, mode_t mode, std::size_t file_size, std::size_t offset, std::size_t size)
: offset_(offset)
, mode_(mode)
{
const auto aligned_size = page_ceil(size + page_size());
const auto aligned_file_size = page_floor(file_size);
const auto aligned_file_offset = page_floor(offset % aligned_file_size);
const auto region1_size = std::min(aligned_size, aligned_file_size - aligned_file_offset);
const auto region2_size = aligned_size - region1_size;
if (region2_size)
{
const auto region1_address = mapped_region(file, read_only, 0, (region1_size + region2_size) * 2).get_address();
const auto region2_address = reinterpret_cast<char*>(region1_address) + region1_size;
region1_ = mapped_region(file, mode, aligned_file_offset, region1_size, region1_address);
region2_ = mapped_region(file, mode, 0, region2_size, region2_address);
}
else
{
region1_ = mapped_region(file, mode, aligned_file_offset, region1_size);
region2_ = mapped_region();
}
size_ = region1_.get_size() + region2_.get_size();
offset_ = aligned_file_offset;
}
auto offset() const -> std::size_t { return offset_; }
auto size() const -> std::size_t { return size_; }
auto data() const -> const void* { return region1_.get_address(); }
auto data() -> void* { return region1_.get_address(); }
auto flush(bool async = true) -> void
{
region1_.flush(async);
region2_.flush(async);
}
auto mode() const -> mode_t { return mode_; }
private:
std::size_t offset_ = 0;
std::size_t size_ = 0;
mode_t mode_;
mapped_region region1_;
mapped_region region2_;
};
struct loop_mapping::impl final
{
std::tr2::sys::path file_path_;
file_mapping file_mapping_;
std::size_t file_size_;
std::size_t map_size_ = page_floor(256000000ULL);
std::shared_ptr<mapping> mapping_ = std::shared_ptr<mapping>(new mapping());
std::shared_ptr<mapping> prev_mapping_;
bool write_;
public:
impl(std::tr2::sys::path path, bool write)
: file_path_(std::move(path))
, file_mapping_(file_path_.string().c_str(), write ? read_write : read_only)
, file_size_(page_floor(std::tr2::sys::file_size(file_path_)))
, write_(write)
{
REQUIRE(file_size_ >= map_size_ * 3);
}
~impl()
{
prev_mapping_.reset();
mapping_.reset();
}
auto data(std::size_t offset, std::size_t size, boost::optional<bool> write_opt) -> void*
{
offset = offset % page_floor(file_size_);
REQUIRE(size < file_size_ - map_size_ * 3);
const auto write = write_opt.get_value_or(write_);
REQUIRE(!write || write_);
if ((write && mapping_->mode() == read_only) || offset < mapping_->offset() || offset + size >= mapping_->offset() + mapping_->size())
{
auto new_mapping = std::make_shared<loop::mapping>(file_mapping_, write ? read_write : read_only, file_size_, page_floor(offset), std::max(size + page_size(), map_size_));
if (mapping_)
mapping_->flush((new_mapping->offset() % file_size_) < (mapping_->offset() % file_size_));
if (prev_mapping_)
prev_mapping_->flush(false);
prev_mapping_ = std::move(mapping_);
mapping_ = std::move(new_mapping);
}
return reinterpret_cast<char*>(mapping_->data()) + offset - mapping_->offset();
}
}
-
// 8 processes to 8 different files 128GB each.
loop_mapping loop(...);
for (auto n = 0; true; ++n)
{
auto src = get_new_data(5000000/8);
auto dst = loop.data(n * 5000000/8, 5000000/8, true);
std::memcpy(dst, src, 5000000/8); // This becomes very slow after loop around.
std::this_thread::sleep_for(std::chrono::seconds(1));
}
Any ideas?
Target System:
1x 3TB Seagate Constellation ES.3
2x Xeon E5-2400 (6 core, 2.6Ghz)
6x 8GB DDR3 1600Mhz ECC
Windows Server 2012

8 buffers each 8 to 512GiB in size on a system with 48GiB of physical memory means that your mapping will have to be swapped. No surprise there.
The issue, as you have already remarked yourself, is that prior to being able to write to a page, you encounter a fault, and the page is read in. That doesn't happen on the first run, since merely a zero page is used. To make matters worse, reading in pages again competes with write-behind of dirty pages.
Now, there is unluckily no way of telling Windows "I'm going to overwrite this anyway", nor is there any way of making the disk load your stuff faster. However, you can start the transfer earlier (maybe when you're 3/4 through the buffer).
Windows Server 2012 (which you're using) supports PrefetchVirtualMemory which is a somewhat half-assed substitute for POSIX madvise(MADV_WILLNEED).
That is, of course, not exactly what you want to do when you already know that you will overwrite the complete memory page (or several of them) anyway, but it is as good as you can get. It's worth a try in any case.
Ideally, you would want to do something like a destructive madvise(MADV_DONTNEED) as implemented e.g. under Linux (and I believe FreeBSD, too) immediately before you overwrite a page, but I am not aware of any way of doing this under Windows (...short of destroying the view and the mapping and mapping from scratch, but then you throw away all data, so that's a bit useless).
Even with prefetching early you will still be limited by disk I/O bandwidth, but at least you can hide the latency.
Another "obvious" (but probably not that easy) solution would be to make the consumer faster. That would allow for a smaller buffer to begin with, and even on a huge buffer it would keep the working set smaller (both producer and consumer force pages into RAM while accessing them, so if the consumer accesses data with less delay after the producer has written them, they will both be using mostly the same set of pages.) Smaller working sets fit into RAM more easily.
But I realize that you probably didn't choose a several-gigabyte buffer for no reason.

Since your code is devoid of any comment, filled with auto variables, not compilable as is and I don't have 512Gb available on my PC to test it anyway, this will remain a passing tought off the top of my head.
each of your process only writes a few hundreds Kb/s, so there should be ample time to flush that to disk in the background.
However, it seems you are asking the boost mapping system to flush either synchronously or asynchronously the previous chunk depending on your mysterious offset computations:
mapping_->flush((new_mapping->offset() % file_size_) < (mapping_->offset() % file_size_));
I guess the rollover triggers a synchronous flush, which is a likely culprit for the sudden slowdown.
What the operating system does at this point depends on the boost implementation, which is not described (or at least in a way obvious enough for me to get it after a cursory look at their man page).
If boost stuffed your 48 Gb of memory with unflushed pages, you could certainly experience a sudden and prolonged deceleration.
At least worth a comment in your code if this mysterious line does something clever and completely different I missed entirely.

If you are able to back the memory mapping with the page file rather than a specific file, you can use the MEM_RESET flag with VirtualAlloc to prevent Windows from paging in the old contents.
The main issue I would anticipate in using this approach is that you can't easily recover the disk space when you are done. It may also require the system's page file settings to be changed; I believe it will work with the default settings, but not if a maximum page file size has been set.

I am going to assume that by "Loop around" you mean that the RAM got full.
What happens is that until the RAM get full, all you have to do is allocate a page and write in it (RAM speed), after the RAM gets full every page allocation turns to 2 actions:
1. you have to write the dirty page back (DISK speed)
2. and allocate a page (RAM speed)
And worst case you also have to bring the page from the file in the disk (DISK speed) if you are reading something from it.
So instead of working only in RAM speed (page allocation), every page allocation runs in DISK speed.
This doesnt happen with 2x8GB because it is small enough for all of the memory of both files to remain fully in the RAM.

The problem here it turns out is that when overwrite a valid page in memory the page first has to be read from the drive before being overwritten. There is no way to get around this issue as far as I know when using memory mapped files.
The reason it doesn't happen during the first pass is that the pages being overwritten are not "valid" and thus they do not need to be read back.

Forwards vs Backwards array walking

Let me first preface this with the fact that I know these kind of micro-optimisations are rarely cost-effective. I'm curious about how stuff works though. For all cacheline numbers etc, I am thinking in terms of an x86-64 i5 Intel CPU. The numbers would obviously differ for different CPUs.
I've often been under the impression that walking an array forwards is faster than walking it backwards. This is, I believed, due to the fact that pulling in large amounts of data is done in a forward-facing manner - that is, if I read byte 0x128, then the cacheline (assuming 64bytes in length) will read in bytes 0x128-0x191 inclusive. Consequently, if the next byte I wanted to access was at 0x129, it would already be in the cache.
However, after reading a bit, I'm now under the impression that it actually wouldn't matter? Because cache line alignment will pick the starting point at the closest 64-divisible boundary, then if I pick byte 0x127 to start with, I will load 0x64-0x127 inclusive, and consequently will have the data in the cache for my backwards walk. I will suffer a cachemiss when transitioning from 0x128 to 0x127, but that's a consequence of where I've picked the addresses for this example more than any real-world consideration.
I am aware that the cachelines are read in as 8-byte chunks, and as such the full cacheline would have to be loaded before the first operation could begin if we were walking backwards, but I doubt it would make a hugely significant difference.
Could somebody clear up if I'm right here, and old me is wrong? I've searched for a full day and still not been able to get a final answer on this.
tl;dr : Is the direction in which we walk an array really that important? Does it actually make a difference? Did it make a difference in the past? (To 15 years back or so)
I have tested with the following basic code, and see the same results forwards and backwards:
#include <windows.h>
#include <iostream>
// Size of dataset
#define SIZE_OF_ARRAY 1024*1024*256
// Are we walking forwards or backwards?
#define FORWARDS 1
int main()
{
// Timer setup
LARGE_INTEGER StartingTime, EndingTime, ElapsedMicroseconds;
LARGE_INTEGER Frequency;
int* intArray = new int[SIZE_OF_ARRAY];
// Memset - shouldn't affect the test because my cache isn't 256MB!
memset(intArray, 0, SIZE_OF_ARRAY);
// Arbitrary numbers for break points
intArray[SIZE_OF_ARRAY - 1] = 55;
intArray[0] = 15;
int* backwardsPtr = &intArray[SIZE_OF_ARRAY - 1];
QueryPerformanceFrequency(&Frequency);
QueryPerformanceCounter(&StartingTime);
// Actual code
if (FORWARDS)
{
while (true)
{
if (*(intArray++) == 55)
break;
}
}
else
{
while (true)
{
if (*(backwardsPtr--) == 15)
break;
}
}
// Cleanup
QueryPerformanceCounter(&EndingTime);
ElapsedMicroseconds.QuadPart = EndingTime.QuadPart - StartingTime.QuadPart;
ElapsedMicroseconds.QuadPart *= 1000000;
ElapsedMicroseconds.QuadPart /= Frequency.QuadPart;
std::cout << ElapsedMicroseconds.QuadPart << std::endl;
// So I can read the output
char a;
std::cin >> a;
return 0;
}
I apologise for A) Windows code, and B) Hacky implementation. It's thrown together to test a hypothesis, but doesn't prove the reasoning.
Any information about how the walking direction could make a difference, not just with cache but also other aspects, would be greatly appreciated!

Just as your experimentation shows, there is no difference. Unlike the interface between the processor and L1 cache, the memory system transacts on full cachelines, not bytes. As #user657267 pointed out, processor specific prefetchers exist. These might preference forward vs. backward, but I heavily doubt it. All modern prefetchers detect direction rather than assuming them. Furthermore, they detect stride as well. They involve incredibly complex logic and something as easy as direction isn't going to be their downfall.
Short answer: go in either direction you want and enjoy the same performance for both!

Optimize check for a bit-vector being a proper subset of another?

I would like some help optimizing the most computationally intensive function of my program.
Currently, I am finding that the basic (non-SSE) version is significantly faster (up to 3x). I would thus request your help in rectifying this.
The function looks for subsets in unsigned integer vectors, and reports if they exist or not. For your convenience I have included the relevant code snippets only.
First up is the basic variant. It checks to see if blocks_ is a proper subset of x.blocks_. (Not exactly equal.) These are bitmaps, aka bit vectors or bitsets.
//Check for self comparison
if (this == &x)
return false;
//A subset is equal to or smaller.
if (no_bits_ > x.no_bits_)
return false;
int i;
bool equal = false;
//Pointers should not change.
const unsigned int *tptr = blocks_;
const unsigned int *xptr = x.blocks_;
for (i = 0; i < no_blocks_; i++, tptr++, xptr++) {
if ((*tptr & *xptr) != *tptr)
return false;
if (*tptr != *xptr)
equal = true;
}
return equal;
Then comes the SSE variant, which alas does not perform according to my expectations. Both of these snippets should look for the same things.
//starting pointers.
const __m128i* start = (__m128i*)&blocks_;
const __m128i* xstart = (__m128i*)&x.blocks_;
__m128i block;
__m128i xblock;
//Unsigned ints are 32 bits, meaning 4 can fit in a register.
for (i = 0; i < no_blocks_; i+=4) {
block = _mm_load_si128(start + i);
xblock = _mm_load_si128(xstart + i);
//Equivalent to (block & xblock) != block
if (_mm_movemask_epi8(_mm_cmpeq_epi32(_mm_and_si128(block, xblock), block)) != 0xffff)
return false;
//Equivalent to block != xblock
if (_mm_movemask_epi8(_mm_cmpeq_epi32(block, xblock)) != 0xffff)
equal = true;
}
return equal;
Do you have any suggestions as to how I may improve upon the performance of the SSE version? Am I doing something wrong? Or is this a case where optimization should be done elsewhere?
I have not yet added in the leftover calculations for no_blocks_ % 4 != 0, but there is little purpose in doing so until the performance increases, and it would only clutter up the code at this point.

There are three possibilities I see here.
First, your data might not suit wide comparisons. If there's a high chance that (*tptr & *xptr) != *tptr within the first few blocks, the plain C++ version will almost certainly always be faster. In that instance, your SSE will run through more code & data to accomplish the same thing.
Second, your SSE code may be incorrect. It's not totally clear here. If no_blocks_ is identical between the two samples, then start + i is probably having the unwanted behavior of indexing into 128-bit elements, not 32-bit as the first sample.
Third, SSE really likes it when instructions can be pipelined, and this is such a short loop that you might not be getting that. You can reduce branching significantly here by processing more than one SSE block at once.
Here's a quick untested shot at processing 2 SSE blocks at once. Note I've removed the block != xblock branch entirely by keeping the state outside of the loop and only testing at the end. In total, this moves things from 1.3 branches per int to 0.25.
bool equal(unsigned const *a, unsigned const *b, unsigned count)
{
__m128i eq1 = _mm_setzero_si128();
__m128i eq2 = _mm_setzero_si128();
for (unsigned i = 0; i != count; i += 8)
{
__m128i xa1 = _mm_load_si128((__m128i const*)(a + i));
__m128i xb1 = _mm_load_si128((__m128i const*)(b + i));
eq1 = _mm_or_si128(eq1, _mm_xor_si128(xa1, xb1));
xa1 = _mm_cmpeq_epi32(xa1, _mm_and_si128(xa1, xb1));
__m128i xa2 = _mm_load_si128((__m128i const*)(a + i + 4));
__m128i xb2 = _mm_load_si128((__m128i const*)(b + i + 4));
eq2 = _mm_or_si128(eq2, _mm_xor_si128(xa2, xb2));
xa2 = _mm_cmpeq_epi32(xa2, _mm_and_si128(xa2, xb2));
if (_mm_movemask_epi8(_mm_packs_epi32(xa1, xa2)) != 0xFFFF)
return false;
}
return _mm_movemask_epi8(_mm_or_si128(eq1, eq2)) != 0;
}
If you've got enough data and a low probability of failure within the first few SSE blocks, something like this should be at least somewhat faster than your SSE.

I seems that your problem is a memory bandwidth bounded problem:
Asymptotic you need about 2 operation for processing a pair of integer in memory scanned. There is not enough arithmetic complexity to get advantage of use more arithmetic throughput from CPU SSE instructions. In fact you CPU pass lot of time waiting for data transfers.
But using SSE instructions in your case induce a overall of instructions and resulting code is not well optimized by compiler.
There are some alternatives strategies to improve performance in bandwidth bounded problem:
Multi-thread hide access memory by concurrent arithmetic
operations in hyper-threading context.
Fine tuning of size of data load at time improve memory bandwidth.
Improve the pipe-line continuity by adding supplementary independents operations in a loop (scan two different sets of data at each step in your "for" loop)
Keep more data in cache or in registers (some iterations of your code may be need the same set of data many times)

Random memory accesses are expensive?

During optimizing my connect four game engine I reached a point where further improvements only can be minimal because much of the CPU-time is used by the instruction TableEntry te = mTable[idx + i] in the following code sample.
TableEntry getTableEntry(unsigned __int64 lock)
{
int idx = (lock & 0xFFFFF) * BUCKETSIZE;
for (int i = 0; i < BUCKETSIZE; i++)
{
TableEntry te = mTable[idx + i]; // bottleneck, about 35% of CPU usage
if (te.height == NOTSET || lock == te.lock)
return te;
}
return TableEntry();
}
The hash table mTable is defined as std::vector<TableEntry> and has about 4.2 mil. entrys (about 64 MB). I have tried to replace the vectorby allocating the table with new without speed improvement.
I suspect that accessing the memory randomly (because of the Zobrist Hashing function) could be expensive, but really that much? Do you have suggestions to improve the function?
Thank you!
Edit: BUCKETSIZE has a value of 4. It's used as collision strategy. The size of one TableEntry is 16 Bytes, the struct looks like following:
struct TableEntry
{ // Old New
unsigned __int64 lock; // 8 8
enum { VALID, UBOUND, LBOUND }flag; // 4 4
short score; // 4 2
char move; // 4 1
char height; // 4 1
// -------
// 24 16 Bytes
TableEntry() : lock(0LL), flag(VALID), score(0), move(0), height(-127) {}
};
Summary: The function originally needed 39 seconds. After making the changes jdehaan suggested, the function now needs 33 seconds (the program stops after 100 seconds). It's better but I think Konrad Rudolph is right and the main reason why it's that slow are the cache misses.

You are making copies of your table entry, what about using TableEntry& as a type. For the default value at the bottom a static default TableEntry() will also do. I suppose that is where you lose much time.
const TableEntry& getTableEntry(unsigned __int64 lock)
{
int idx = (lock & 0xFFFFF) * BUCKETSIZE;
for (int i = 0; i < BUCKETSIZE; i++)
{
// hopefuly now less than 35% of CPU usage :-)
const TableEntry& te = mTable[idx + i];
if (te.height == NOTSET || lock == te.lock)
return te;
}
return DEFAULT_TABLE_ENTRY;
}

How big is a table entry? I suspect it's the copy that is expensive not the memory lookup.
Memory accesses are quicker if they are contiguous because of cache hits, but it seem you are doing this.

The point about copying the TableEntry is valid. But let’s look at this question:
I suspect that accessing the memory randomly (…) could be expensive, but really that much?
In a word, yes.
Random memory access with an array of your size is a cache killer. It will generate lots of cache misses which can be up to three orders of magnitude slower than access to memory in cache. Three orders of magnitude – that’s a factor 1000.
On the other hand, it actually looks as though you are using lots of array elements in order, even though you generated your starting point using a hash. This speaks against the cache miss theory, unless your BUCKETSIZE is tiny and the code gets called very often with different lock values from the outside.

I have seen this exact problem with hash tables before. The problem is that continuous random access to the hashtable touch all of the memory used by the table (both the main array and all of the elements). If this is large relative to your cache size you will thrash. This manifests as the exact problem you are encountering: That instruction which first references new memory appears to have a very high cost due to the memory stall.
In the case I worked on, a further issue was that the hash table represented a rather small part of the key space. The "default" value (similar to what you call DEFAULT_TABLE_ENTRY) applied to the vast majority of keys so it seemed like the hash table was not heavily used. The problem was that although default entries avoided many inserts, the continuous action of searching touched every element of the cache over and over (and in random order). In that case I was able to move the values from the hashed data to live with the associated structure. It took more overall space because even keys with the default value had to explicitly store the default value, but the locality of reference was vastly improved and the performance gain was huge.

Use pointers
TableEntry* getTableEntry(unsigned __int64 lock) {
int idx = (lock & 0xFFFFF) * BUCKETSIZE;
TableEntry* max = &mTable[idx + BUCKETSIZE];
for (TableEntry* te = &mTable[idx]; te < max; te++)
{
if (te->height == NOTSET || lock == te->lock)
return te;
}
return DEFAULT_TABLE_ENTRY; }

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js