I would like to perform the following operation as quickly as possible
x / LSB(x)
where x is an integral value unknown at compile time and LSB(x) = x & -x.
(Alternatively, the operation is equivalent to an even division by the highest power of 2 <= x.) I am looking for a reasonably portable solution (without compiler intrinsics/builtins like GCC's __builtin_clz or alike).
My concern is that the following simple implementation
x / (x & -x)
would still result in an expensive division as compiler might fail to realize that the division is in fact equivalent to right-shift by the number of trailing zeroes in the divisor.
If my concerns are reasonable, what would be a more efficient way to implement it?
I would appreciate a solution that is easily extendible to integral types of sizes 32-bit, 64-bits, 128-bits, ...
How about
x >>= ffs(x)-1;
The ffs function conforms to 4.3BSD, POSIX.1-2001.
It won't work if x is 0.
If you don't want to rely on CLZ (count leading zeros) hardware instructions, you can count leading zeros as described in this answer. It's very fast with a look-up and multiplication by a magic number. I'll re-post the code here:
unsigned x; // input to clz
unsigned c; // output of clz
static const unsigned MultiplyDeBruijnBitPosition[32] =
{
0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
};
c = MultiplyDeBruijnBitPosition[((unsigned)((x & -x) * 0x077CB531U)) >> 27];
Once you have counted the leading zeros, you no loner need to use a division instruction. Instead, you can just shift the value right by c. That is (eliminating an unneeded temporary value), the code becomes this:
static const unsigned MultiplyDeBruijnBitPosition[32] =
{
0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
};
x >>= MultiplyDeBruijnBitPosition[((unsigned)((x & -x) * 0x077CB531U)) >> 27]; // x /= LSB(x)
Related
I'm reading an ESRI Shapefile, and to my dismay it uses big endian and little endian at different points (see, for instance, the table at page 4, plus the tables from page 5 to 8).
So I created two functions in C++, one for each endianness.
uint32_t readBig(ifstream& f) {
uint32_t num;
uint8_t buf[4];
f.read((char*)buf,4);
num = buf[3] | buf[2]<<8 | buf[1]<<16 | buf[0]<<24;
return num;
}
uint32_t readLittle(ifstream& f) {
uint32_t num;
f.read(reinterpret_cast<char *>(&num),4);
//f.read((char*)&num,4);
return num;
}
But I'm not sure this is the most efficient way to do it. Can this code be improved? Keep in mind it will run thousands, maybe millions of times for a single shapefile. So to have even one of the functions calling the other seem worse than to have two separate functions. Is there a difference in performance between using reinterpret_cast or explicit type conversion (char*)? Should I use the same in both functions?
Casting between pointer types does not affect performance -- In
this case, it's just a technicality to make the compiler happy.
If you're really making a separate call to read for every 32-bit
value, the time taken by the byte-swapping operation will likely be
in the noise. For speed, you probably should have your own
buffering layer so that you inner loop doesn't make any function
calls.
It's nice if the swap compiles down to a single opcode (like bswap), but whether or not that
is possible, or the fastest option, is processor-specific.
If you're really interested in maximizing speed, consider using SIMD intrinsics.
In most cases the compiler should generate a bswap instruction, which is probably sufficient. If however you need something faster than that, vpshufb is your friend...
#include <immintrin.h>
#include <cstdint>
// swap byte order in 16 x int16
inline void swap_16xi16(uint16_t input[16])
{
constexpr uint8_t mask_data[] = {
1, 0,
3, 2,
5, 4,
7, 6,
9, 8,
11, 10,
13, 12,
15, 14,
1, 0,
3, 2,
5, 4,
7, 6,
9, 8,
11, 10,
13, 12,
15, 14
};
const __m256i swapped = _mm256_shuffle_epi8(
_mm256_loadu_si256((const __m256i*)input),
_mm256_loadu_si256((const __m256i*)mask_data)
);
_mm256_storeu_si256((__m256i*)input, swapped);
}
// swap byte order in 8 x int32
inline void swap_8xi32(uint32_t input[8])
{
constexpr uint8_t mask_data[] = {
3, 2, 1, 0,
7, 6, 5, 4,
11, 10, 9, 8,
15, 14, 13, 12,
3, 2, 1, 0,
7, 6, 5, 4,
11, 10, 9, 8,
15, 14, 13, 12
};
const __m256i swapped = _mm256_shuffle_epi8(
_mm256_loadu_si256((const __m256i*)input),
_mm256_loadu_si256((const __m256i*)mask_data)
);
_mm256_storeu_si256((__m256i*)input, swapped);
}
// swap byte order in 4 x int64
inline void swap_4xi64(uint64_t input[4])
{
constexpr uint8_t mask_data[] = {
7, 6, 5, 4, 3, 2, 1, 0,
15, 14, 13, 12, 11, 10, 9, 8,
7, 6, 5, 4, 3, 2, 1, 0,
15, 14, 13, 12, 11, 10, 9, 8
};
const __m256i swapped = _mm256_shuffle_epi8(
_mm256_loadu_si256((const __m256i*)input),
_mm256_loadu_si256((const __m256i*)mask_data)
);
_mm256_storeu_si256((__m256i*)input, swapped);
}
inline void swap_16xi16(int16_t input[16])
{ swap_16xi16((uint16_t*)input); }
inline void swap_8xi32(int32_t input[8])
{ swap_8xi32((uint32_t*)input); }
inline void swap_4xi64(int64_t input[4])
{ swap_4xi64((uint64_t*)input); }
inline void swap_8f(float input[8])
{ swap_8xi32((uint32_t*)input); }
inline void swap_4d(double input[4])
{ swap_4xi64((uint64_t*)input); }
Is there a (fast) way to perform bits reverse of 32bit int values within avx2 register?
E.g.
_mm256_set1_epi32(2732370386);
<do something here>
//binary: 10100010110111001010100111010010 => 1001011100101010011101101000101
//register contains 1268071237 which is decimal representation of 1001011100101010011101101000101
Since I can't find a suitable dupe, I'll just post it.
The main idea here is to make use of pshufb's dual use a parallel 16-entry table lookup to reverse the bits of each nibble. Reversing bytes is obvious. Reversing the order of the two nibble in every byte could be done by building it into the lookup tables (saves a shift) or by explicitly shifting the low part nibble up (saves a LUT).
Something like this in total, not tested:
__m256i rbit32(__m256i x) {
__m256i shufbytes = _mm256_setr_epi8(3, 2, 1, 0, 7, 6, 5, 4, 11, 10, 9, 8, 15, 14, 13, 12, 3, 2, 1, 0, 7, 6, 5, 4, 11, 10, 9, 8, 15, 14, 13, 12);
__m256i luthigh = _mm256_setr_epi8(0, 8, 4, 12, 2, 10, 6, 14, 1, 9, 5, 13, 3, 11, 7, 15, 0, 8, 4, 12, 2, 10, 6, 14, 1, 9, 5, 13, 3, 11, 7, 15);
__m256i lutlow = _mm256_slli_epi16(luthigh, 4);
__m256i lowmask = _mm256_set1_epi8(15);
__m256i rbytes = _mm256_shuffle_epi8(x, shufbytes);
__m256i high = _mm256_shuffle_epi8(lutlow, _mm256_and_si256(rbytes, lowmask));
__m256i low = _mm256_shuffle_epi8(luthigh, _mm256_and_si256(_mm256_srli_epi16(rbytes, 4), lowmask));
return _mm256_or_si256(low, high);
}
In a typical context in a loop, those loads should be lifted out.
Curiously Clang uses 4 shuffles, it's duplicating the first shuffle.
I'm discovering Halide and got some success with a pipeline doing various
transformations. Most of these are based on the examples within the sources (color-transformations, various filters, hist-eq).
My next step needs to process the image in blocks. In a more general form,
partially-overlapping blocks.
Examples
Input:
[ 1, 2, 3, 4, 5, 6, 7, 8,
9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24,
25, 26, 27, 28, 29, 30, 31, 32]
Non-overlapping blocks:
Size: 2x4
[ 1, 2, 3, 4,
9, 10, 11, 12]
[ 5, 6, 7, 8,
13, 14, 15, 16]
[ 17, 18, 19, 20,
25, 26, 27, 28]
[ 21, 22, 23, 24,
29, 30, 31, 32]
Overlapping blocks:
Size: 2x4 with 50% overlap (both axes)
[ 1, 2, 3, 4,
9, 10, 11, 12]
[ 3, 4, 5, 6,
11, 12, 13, 14]
[ 5, 6, 7, 8,
13, 14, 15, 16]
-
[ 9, 10, 11, 12,
17, 18, 19, 20]
[11, 12, 13, 14,
19, 20, 21, 22]
...
I suspect there should be a nice way to express these, as those are also quite common
in many algorithms (e.g. macroblocks).
What i checked out
I tried to gather ideas from the tutorial and example apps and found the following,
which seem somewhat connected to what i want to implement:
Halide tutorial lesson 6: Realizing Funcs over arbitrary domains
// We start by creating an image that represents that rectangle
Image<int> shifted(5, 7); // In the constructor we tell it the size
shifted.set_min(100, 50); // Then we tell it the top-left corner
The problem i have: how to generalize this to multiple shifted domains without looping?
Halide tutorial lesson 9: Multi-pass Funcs, update definitions, and reductions
Here RDom is introduced which looks nice to create a block-view
Most examples using RDom seem to be sliding-window like approaches where there are no jumps
Target
So in general i'm asking how to implement a block-based view which can then be processed by
other steps.
It would be nice if the approach will be general enough to realize both, overlapping & no overlapping
Somehow generating the top-left indices first?
In my case, the image-dimension is known at compile-time which simplifies this
But i still would like some compact form which is nice to work with from Halide's perspective (no handcoded stuff like those examples with small filter-boxes)
The approach used might be depending on the output per block, which is a scalar in my case
Maybe someone can give me some ideas and/or some examples (which would be very helpful).
I'm sorry for not providing code, as i don't think i could produce anything helpful.
Edit: Solution
After dsharlet's answer and some tiny debugging/discussion here, the following very simplified self-containing code works (assuming an 1-channel 64x128 input like this one i created).
#include "Halide.h"
#include "Halide/tools/halide_image_io.h"
#include <iostream>
int main(int argc, char **argv) {
Halide::Buffer<uint8_t> input = Halide::Tools::load_image("TestImages/block_example.png");
// This is a simple example assuming an input of 64x128
std::cout << "dim 0: " << input.width() << std::endl;
std::cout << "dim 1: " << input.height() << std::endl;
// The "outer" (block) and "inner" (pixel) indices that describe a pixel in a tile.
Halide::Var xo, yo, xi, yi, x, y;
// The distance between the start of each tile in the input.
int tile_stride_x = 32;
int tile_stride_y = 64;
int tile_size_x = 32;
int tile_size_y = 64;
Halide::Func tiled_f;
tiled_f(xi, yi, xo, yo) = input(xo * tile_stride_x + xi, yo * tile_stride_y + yi);
Halide::RDom tile_dom(0, tile_size_x, 0, tile_size_y);
Halide::Func tile_means;
tile_means(xo, yo) = sum(Halide::cast<uint32_t>(tiled_f(tile_dom.x, tile_dom.y, xo, yo))) / (tile_size_x * tile_size_y);
Halide::Func output;
output(xo, yo) = Halide::cast<uint8_t>(tile_means(xo, yo));
Halide::Buffer<uint8_t> output_(2, 2);
output.realize(output_);
Halide::Tools::save_image(output_, "block_based_stuff.png");
}
Here's an example that breaks a Func into blocks of abitrary stride and size:
Func f = ... // The thing being blocked
// The "outer" (block) and "inner" (pixel) indices that describe a pixel in a tile.
Var xo, yo, xi, yi;
// The distance between the start of each tile in the input.
int tile_stride_x, tile_stride_y;
Func tiled_f;
tiled_f(xi, yi, xo, yo) = f(xo * tile_stride_x + xi, yo * tile_stride_y + yi);
Func tiled_output;
tiled_output(xi, yi, xo, yo) = ... // Your tiled processing here
To compute some reduction (like statistics) on each block, you can do the following:
RDom tile_dom(0, tile_size_x, 0, tile_size_y);
Func tile_means;
tile_means(xo, yo) = sum(tiled_output(tile_dom.x, tile_dom.y, xo, yo)) / (tile_size_x * tile_size_y);
To flatten the tiles back into a result is a bit tricky. It probably depends on your method of combining the results in overlapped areas. If you want to add up the overlapping tiles, the simplest way is probably to use an RDom:
RDom tiles_dom(
0, tile_size_x,
0, tile_size_y,
min_tile_xo, extent_tile_xo,
min_tile_yo, extent_tile_yo);
Func output;
Expr output_x = tiles_dom[2] * tile_stride_x + tiles_dom[0];
Expr output_y = tiles_dom[3] * tile_stride_y + tiles_dom[1];
output(x, y) = 0;
output(output_x, output_y) += tiled_output(tiles_dom[0], tiles_dom[1], tiles_dom[2], tiles_dom[3]);
Note that in the above two blocks of code, tile_stride_x and tile_size_x are independent parameters, allowing for any tile size and overlap.
In both of your examples, tile_size_x = 4, and tile_size_y = 2. To get non-overlapping tiles, set the tile strides equal to the tile size. To get 50% overlapping tiles, set tile_stride_x = 2, and tile_stride_y = 1.
A useful schedule for an algorithm like this is:
// Compute tiles as needed by the output.
tiled_output.compute_at(output, tile_dom[2]);
// or
tiled_output.compute_at(tile_means, xo);
There are other options, like using a pure func (no update/RDom) that uses the mod operator to figure out tile inner and outer indices. However, this approach can be difficult to schedule efficiently with overlapping tiles (depending on the processing you do at each tile). I use the RDom approach when this problem comes up.
Note that with the RDom approach, you have to supply the bounds of the tile indices you want computed (min_tile_xo, extent_tile_xo, ...), which can be tricky for overlapped tiles.
I have just started my journey with Wolfram Mathematica and I want to implement a simple genetic algorithm. The construction of the data is given and I have to start with such rows/columns.
Here is what I have:
chromosome := RandomSample[CharacterRange["A", "G"], 7]
chromosomeList = Table[chromosome, 7] // MatrixForm
This gives me a matrix, where every row represents a chromosome:
yPos = Flatten[Position[chromosomeList, #], 1] & /# {"A", "B", "C",
"D", "E", "F", "G"};
yPos = yPos[[All, 3 ;; 21 ;; 3]] // Transpose
Now every column represents a letter (From A to G) and every row it's index in every chromosome:
Here is a given efficiency matrix, where very row represents different letter (From A to G) and every column gives the value that should be applied on the particular position:
efficiencyMatrix = {
{34, 31, 20, 27, 24, 24, 18},
{14, 14, 22, 34, 26, 19, 22},
{22, 16, 21, 27, 35, 25, 30},
{17, 21, 24, 16, 31, 22, 20},
{17, 29, 22, 31, 18, 19, 26},
{26, 29, 37, 34, 37, 20, 21},
{30, 28, 37, 28, 29, 23, 19}}
What I want to do is to create a matrix with values that correspond to the letter and it's position. I have done it like that:
values = Transpose[{ efficiencyMatrix[[1, yPos[[1]]]],
efficiencyMatrix[[2, yPos[[2]]]],
efficiencyMatrix[[3, yPos[[3]]]],
efficiencyMatrix[[4, yPos[[4]]]],
efficiencyMatrix[[5, yPos[[5]]]],
efficiencyMatrix[[6, yPos[[6]]]],
efficiencyMatrix[[7, yPos[[7]]]]}]
How can I write it in more elegant way?
You can apply a list of functions to some variable using the function Through, which is helpful when applying Position multiple times. Because Position[patt][expr] == Position[expr, patt], we can do
Through[ (Position /# CharacterRange["A","C"])[{"B", "C", "A"}] ]
to get {3, 1, 2}.
Position can also operate on lists, so we can simplify finding ypos by doing
Transpose#Map[Last, Through[(Position /# characters)[chromosomeList]], {2}]
where characters is the relevant output of CharacterRange.
We can also simplify dealing with ranges of integers by mapping over the Range function, so in total we end up with
characters = CharacterRange["A","G"]
efficiencies = ...
chromosomes = ...
ypos = Transpose#Map[Last, Through[(Position /# characters)[chromosomes]], {2}];
efficiencies[[#, ypos[[#]]]]& /# Range[Length[characters]] //Transpose ]
This question already has answers here:
Position of least significant bit that is set
(23 answers)
Closed 7 years ago.
If I know an integer k = 2^n, how can I efficiently find n?
In other words, if I know a single bit in a integer is set, how can I get the location of the bit?
An idea is to find the Hamming Weight of k-1 but are there any other simpler ways I'm not thinking about?
Bit Twiddling Hacks have a lot of amazing (and obscure, performance oriented) bit hacks. The best one for your use seems using multiply and lookup.
unsigned int v; // find the number of trailing zeros in 32-bit v
int r; // result goes here
static const int MultiplyDeBruijnBitPosition[32] =
{
0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
};
r = MultiplyDeBruijnBitPosition[((uint32_t)((v & -v) * 0x077CB531U)) >> 27];
This page provides a detailed analysis of the problem with a focus on chess programming.
Answer copied from here.
PS: Didn't know how to give credit to the author on that question, and so wrote it like this.