I have a square boolean matrix M of size N, stored by rows and I want to count the number of bits set to 1 for each column.
For instance for n=4:
M stored as { { 1,1,0,1}, {0,1,0,1}, {0,0,0,1}, {1,0,0,1} };
result = { 2, 2, 0, 4};
I can obviously
transpose the matrix M into a matrix M'
popcount each row of M'.
Good algorithms exist for matrix transposition and popcounting through bit manipulation.
My question is: would it be possible to "merge" such algorithms into a single one ?
Note that N could be quite large (say 1024 and more) regarding 64 bits architecture.
Related: Count each bit-position separately over many 64-bit bitmasks, with AVX but not AVX2 and https://github.com/mklarqvist/positional-popcount
I had another idea which I haven't finished writing up nicely.
Godbolt link to messy work-in-progress which doesn't have correct loop bounds / cleanup, but for large buffers runs ~3x faster than #edrezen's version on my Skylake i7-6700k, with g++7.3 -O3 -march=native. See the test_SWAR_avx2 function. (I know it doesn't compile on Godbolt; Agner Fog's asmlib.h isn't present.)
I might have some columns in the wrong order, too, but from stepping through the asm I think it's doing the right amount of work. i.e. any necessary bugfixes won't slow it down.
I used 16-bit accumulators, so another outer loop might be necessary if you care about inputs large enough to overflow 16-bit per-column counters.
Interesting observation: An earlier buggy version of my loop used sum0123 twice in store_globalsums_from_vec16, leaving sum4567 unused, so it optimized away in the main loop. With less work, gcc fully unrolled the large for(int i=0 ; i<5 ; i++) loop, and the code ran slower, like about 1 cycle per byte instead of 0.5. The loop was probably too big for the uop cache or something (I didn't profile yet but a front-end decode bottleneck would explain it). For some reason #edrezen's version is only running at about 1.5c/B for me, not the ~1.25 reported in the answer. My CPU is actually running 3.9GHz, but Agner Fog's library detects it at 4.0, but that's not enough to explain it.
Also, gcc spills sum4567_16bit to the stack, so we're already pushing the boundary of register pressure without AVX512. It's updated infrequently and isn't a problem, but needing more accumulators in the inner loop could be.
Your data layout isn't clear about when the number of columns isn't 32.
It seems that for each uint32_t chunk of 32 columns, you have all the rows stored contiguously in memory. i.e. looping over the rows for a column is efficient. If you had more than 32 columns, the rows for columns 32..63 will be contiguous and come after all the rows for columns 0..31.
(If instead you have all the columns for a single row contiguous, you could still use this idea, but might need to spill/reload some accumulators to memory, or let the compiler do that for you if it makes good choices.)
So loading a 32-byte (8 dword) vector gets 8 rows of data for one column chunk. That's extremely convenient, and allows widening from 1-bit (in memory) to 2-bit accumulators, then grab more data before we widen to 4-bit, and so on, summing along the way so we get significant work done while the data is still dense. (Rather than only adding 1 bit (0 or 1) per byte to vector accumulators.)
The more we unroll, the more data we can grab from memory to make better use of the coding space in our vectors. i.e. our variables have higher entropy. Throwing around more data (in terms of bits of memory that contributed to it) per vpaddb/w/d/q or unpack/shuffle instruction is a Good Thing.
Accumulators narrower than 1 byte within a SIMD vector is basically an https://en.wikipedia.org/wiki/SWAR technique, where you have to AND away bits that you shift past an element boundary, because we don't have SIMD element boundaries to do it for us. (And we avoid overflow anyway, so ADD carrying into the next element isn't a problem.)
Each inner loop iteration:
take a vector of data from the same columns in each of 2 or 3 (groups of) rows. So you either have 3 * 8 rows from one chunk of 32 columns, or 3 rows of 256 columns.
mask them with set1(0b01010101) to get the even (low) bits, and with (vec>>1) & mask (_mm256_srli_epi32(v,1)) to get the odd (high) bits. Use _mm256_add_epi8 to accumulate within those 2-bit accumulators. They can't overflow with only 3 ones, so carry-propagation boundaries don't actually matter.
Each byte of your vector has 4 separate vertical sums, and you have two vectors (odd/even).
Repeat the above again, to get another pair of vectors from 3 vectors of data from memory.
Combine again to get 4 vectors of 4-bit accumulators (with possible values 0..6). Still without mixing bits from within a single 32-bit element, of course, because we must never do that. Shifts only move bits for odd / high columns to the bottom of the 2-bit or 4-bit unit that contains them so they can be added with bits that were moved the same way in other vectors.
_mm256_unpacklo/hi_epi8 and mask or shift+mask to get 8-bit accumulators
Put the above in a loop that runs up to 5 times, so the 0..12 accumulator values go up to 0..60 (i.e. leaving 2 bits of headroom for unpacking the 8-bit accumulators, using all their coding space.)
If you have the data layout from your answer, then we can add data from dword elements within the same vector. We can do that so we don't run out of registers when widening our accumulators up to 16-bit (because x86-64 only has 16 YMM registers, and we need some for constants.)
_mm256_unpacklo/hi_epi16 and add, to interleave pairs of 8-bit counters so a group of counters for the same column has expanded from a dword to a qword.
Repeat this general idea to reduce the number of registers (or __m256i variables) your accumulators are spread over.
Efficiently handling the lack of a lane-crossing 2-input byte or word shuffle is inconvenient, but it's a pretty small part of the total work. vextracti128 / vpaddb xmm -> vpmovzxbw worked well enough.
I made some benchmark between the two approaches:
transpose + popcount
update row by row
I wrote a naive version and an AVX2 one for both approaches. I used some functions (found on stackoverflow or elsewhere) for the AVX2 "transpose+popcount" approach.
In my test, I make the assumption that the input is a nbRowsx32 matrix in a bits packed format (nbRows itself being a multiple of 32); the matrix is therefore stored as an array of uint32_t.
The code is the following:
#include <cinttypes>
#include <cstdio>
#include <cstring>
#include <cmath>
#include <cassert>
#include <chrono>
#include <immintrin.h>
#include <asmlib.h>
using namespace std;
using namespace std::chrono;
// see https://stackoverflow.com/questions/24225786/fastest-way-to-unpack-32-bits-to-a-32-byte-simd-vector
static __m256i expand_bits_to_bytes (uint32_t x);
// see https://mischasan.wordpress.com/2011/10/03/the-full-sse2-bit-matrix-transpose-routine/
static void sse_trans(char const *inp, char *out);
static double deviation (double n, double sum2, double sum);
// Naive approach (matrix transposition)
void test_transpose_popcnt_naive (uint64_t nbRows, const uint32_t* bitmap, uint64_t* globalSums)
assert (nbRows%32==0);
uint8_t transpo[32][32]; memset (transpo, 0, sizeof(transpo));
for (uint64_t k=0; k<nbRows; k+=32)
// We unpack and transpose the input into a 32x32 bytes matrix
for (size_t row=0; row<32; row++)
for (size_t col=0; col<32; col++) { transpo[col][row] = (bitmap[k+row] >> col) & 1 ; }
for (size_t row=0; row<32; row++)
// We popcount the current row
u_int8_t sum=0;
for (size_t col=0; col<32; col++) { sum += transpo[row][col]; }
// We update the corresponding global sum
globalSums[row] += sum;
// Naive approach (row by row)
void test_update_row_by_row_naive (uint64_t nbRows, const uint32_t* bitmap, uint64_t* globalSums)
for (uint64_t row=0; row<nbRows; row++)
for (size_t col=0; col<32; col++)
globalSums[col] += (bitmap[row] >> col) & 1;
// AVX2 (matrix transposition + popcount)
void test_transpose_popcnt_avx2 (uint64_t nbRows, const uint32_t* bitmap, uint64_t* globalSums)
assert (nbRows%32==0);
uint32_t transpo[32];
const uint32_t* loop = bitmap;
for (uint64_t k=0; k<nbRows; loop+=32, k+=32)
// We transpose the input as a 32x32 bytes matrix
sse_trans ((const char*)loop, (char*)transpo);
// We update the global sums
for (size_t i=0; i<32; i++)
globalSums[i] += __builtin_popcount (transpo[i]);
// AVX2 approach (update totals row by row)
// Note: we use template specialization to unroll some portions of a loop
template<int N>
void UpdateLocalSums (__m256i& localSums, const uint32_t* bitmap, uint64_t& k)
// We update the local sums with the current row
localSums = _mm256_sub_epi8 (localSums, expand_bits_to_bytes (bitmap[k++]));
// Go recursively
UpdateLocalSums<N-1>(localSums, bitmap, k);
void UpdateLocalSums<0> (__m256i& localSums, const uint32_t* bitmap, uint64_t& k)
// Dillon Davis proposal: use 4 registers holding uint32_t values and update them from local sums with AVX2
void test_update_row_by_row_avx2 (uint64_t nbRows, const uint32_t* bitmap, uint64_t* globalSums)
union U256i { __m256i v; uint8_t a[32]; uint32_t b[8]; };
// We use 1 register for updating local totals
__m256i localSums = _mm256_setzero_si256();
// Dillon Davis proposal: use 4 registers holding uint32_t values and update them from local sums with AVX2
__m256i globalSumsReg[4]; for (size_t r=0; r<4; r++) { globalSumsReg[r] = _mm256_setzero_si256(); }
uint64_t steps = nbRows / 255;
uint64_t k=0;
const int divisorOf255 = 5;
// We iterate over all rows
for (uint64_t i=0; i<steps; i++)
// we update the local totals (255*32=8160 additions)
for (int j=0; j<255/divisorOf255; j++)
// unroll some portion of the 255 loop through template specialization
UpdateLocalSums<divisorOf255>(localSums, bitmap, k);
// Dillon Davis proposal: use 4 registers holding uint32_t values and update them from local sums
// We take the 128 high bits of the local sums
__m256i localSums2 = _mm256_broadcastsi128_si256(_mm256_extracti128_si256(localSums,1));
globalSumsReg[0] = _mm256_add_epi32 (globalSumsReg[0],
_mm256_cvtepu8_epi32 (_mm256_castsi256_si128 (_mm256_srli_si256(localSums, 0)))
globalSumsReg[1] = _mm256_add_epi32 (globalSumsReg[1],
_mm256_cvtepu8_epi32 (_mm256_castsi256_si128 (_mm256_srli_si256(localSums, 8)))
globalSumsReg[2] = _mm256_add_epi32 (globalSumsReg[2],
_mm256_cvtepu8_epi32 (_mm256_castsi256_si128 (_mm256_srli_si256(localSums2, 0)))
globalSumsReg[3] = _mm256_add_epi32 (globalSumsReg[3],
_mm256_cvtepu8_epi32 (_mm256_castsi256_si128 (_mm256_srli_si256(localSums2, 8)))
// we update the global totals
U256i tmp = { localSums };
for (size_t k=0; k<32; k++) { globalSums[k] += tmp.a[k]; }
// we reset the local totals
localSums = _mm256_setzero_si256();
// We update the global totals into the final uint32_t array
for (size_t r=0; r<4; r++)
U256i tmp = { globalSumsReg[r] };
for (size_t k=0; k<8; k++) { globalSums[r*8+k] += tmp.b[k]; }
// we update the remaining local totals
for (uint64_t i=steps*255; i<nbRows; i++)
UpdateLocalSums<1>(localSums, bitmap, k);
// we update the global totals
U256i tmp = { localSums };
for (size_t k=0; k<32; k++) { globalSums[k] += tmp.a[k]; }
void execute (
const char* name,
void (*fct)(uint64_t nbRows, const uint32_t* bitmap, uint64_t* globalSums),
size_t nbRuns,
uint64_t nbRows,
u_int32_t* bitmap
uint64_t sums[32];
double timeTotal=0;
double cycleTotal=0;
double timeTotal2=0;
double cycleTotal2=0;
uint64_t check=0;
for (size_t n=0; n<nbRuns; n++)
// We want both time and cpu cycles information
milliseconds t0 = duration_cast< milliseconds >(system_clock::now().time_since_epoch());
uint64_t c0 = ReadTSC();
// We run the test
(*fct) (nbRows, bitmap, sums);
uint64_t c1 = ReadTSC();
milliseconds t1 = duration_cast< milliseconds >(system_clock::now().time_since_epoch());
timeTotal += (t1-t0).count();
cycleTotal += (double)(c1-c0) / nbRows;
timeTotal2 += (t1-t0).count() * (t1-t0).count();
cycleTotal2 += ((double)(c1-c0) / nbRows) * ((double)(c1-c0) / nbRows);
// We compute some dummy checksum
for (size_t k=0; k<32; k++) { check += sums[k]; }
printf ("%-21s | %5.0lf (%5.1lf) | %5.2lf (%4.2lf) | %.3lf | 0x%lx\n",
timeTotal / nbRuns,
deviation (nbRuns, timeTotal2, timeTotal),
deviation (nbRuns, cycleTotal2, cycleTotal),
nbRows * cycleTotal / timeTotal / 1000000.0
int main(int argc, char **argv)
// We set rows number as 2^n where n is the provided argument
// For simplification, we assume that the rows number is a multiple of 32
uint64_t nbRows = 1ULL << (argc>1 ? atoi(argv[1]) : 28);
size_t nbRuns = argc>2 ? atoi(argv[2]) : 10;
// We build an bitmap of size nbRows*32
uint32_t* bitmap = new uint32_t[nbRows];
if (bitmap==nullptr)
fprintf(stderr, "unable to allocate the bitmap\n");
// We fill the bitmap with random values
for (uint64_t i=0; i<nbRows; i++) { bitmap[i] = rand() & 0xFFFFFFFF; }
printf ("\n");
printf ("nbRows=%ld nbRuns=%ld\n", nbRows, nbRuns);
printf ("------------------------------------------------------------------------------------------------------------\n");
printf ("name | time in msec : mean (sd) | cycles/row : mean (sd) | frequency in GHz | checksum\n");
printf ("------------------------------------------------------------------------------------------------------------\n");
// We launch the benchmark
execute ("naive (transpo) ", test_transpose_popcnt_naive, nbRuns, nbRows, bitmap);
execute ("naive (row by row)", test_update_row_by_row_naive, nbRuns, nbRows, bitmap);
execute ("AVX2 (transpo) ", test_transpose_popcnt_avx2, nbRuns, nbRows, bitmap);
execute ("AVX2 (row by row)", test_update_row_by_row_avx2, nbRuns, nbRows, bitmap);
printf ("\n");
// Some clean up
delete[] bitmap;
__m256i expand_bits_to_bytes(uint32_t x)
__m256i xbcast = _mm256_set1_epi32(x);
// Each byte gets the source byte containing the corresponding bit
__m256i shufmask = _mm256_set_epi64x(
0x0303030303030303, 0x0202020202020202,
0x0101010101010101, 0x0000000000000000);
__m256i shuf = _mm256_shuffle_epi8(xbcast, shufmask);
__m256i andmask = _mm256_set1_epi64x(0x8040201008040201); // every 8 bits -> 8 bytes, pattern repeats.
__m256i isolated_inverted = _mm256_and_si256(shuf, andmask);
// Avoid an _mm256_add_epi8 thanks to Peter Cordes's comment
return _mm256_cmpeq_epi8(isolated_inverted, andmask);
void sse_trans(char const *inp, char *out)
#define INP(x,y) inp[(x)*4 + (y)/8]
#define OUT(x,y) out[(y)*4 + (x)/8]
int rr, cc, i, h;
union { __m256i x; uint8_t b[32]; } tmp;
for (cc = 0; cc < 32; cc += 8)
for (i = 0; i < 32; ++i)
tmp.b[i] = INP(i, cc);
for (i = 8; i--; tmp.x = _mm256_slli_epi64(tmp.x, 1))
*(uint32_t*)&OUT(0, cc + i) = _mm256_movemask_epi8(tmp.x);
double deviation (double n, double sum2, double sum) { return sqrt (sum2/n - (sum/n)*(sum/n)); }
Some remarks:
I used the Agner Fog's asmlib to have a function that returns CPU cycles
The compilation command is g++ -O3 -march=native ../Test.cpp -o ./Test -laelf64
The gcc version is 7.3.1
The CPU is Intel(R) Core(TM) i7-6700HQ CPU # 2.60GHz
I compute some dummy checksum to compare the results of the different tests
Now the results:
name | time in msec : mean (sd) | cycles/row : mean (sd) | frequency in GHz | checksum
naive (transpo) | 4548 ( 36.5) | 43.91 (0.35) | 2.592 | 0x9affeb5a6
naive (row by row) | 3033 ( 11.0) | 29.29 (0.11) | 2.592 | 0x9affeb5a6
AVX2 (transpo) | 767 ( 12.8) | 7.40 (0.12) | 2.592 | 0x9affeb5a6
AVX2 (row by row) | 130 ( 4.0) | 1.25 (0.04) | 2.591 | 0x9affeb5a6
So it seems that the "row by row" in AVX2 is the best so far.
Note that when I saw this result (less than 2 cycles per row), I made no more effort to optimize the AVX2 "transpose+popcount" method, which should be feasable by computing several popcounts in parallel (I may test it later).
I eventually wrote another implementation, following the high entropy SWAR approach proposed by Peter Cordes. This implementation is recursive and relies on C++ template specialization.
The global idea is to fill N-bit accumulators to their maximum without carry overflow (this is where recursion is used). When these accumulators are filled, we update the grand totals and we start again with new N-bit accumulators to fill until all rows have been processed.
Here is the code (see function test_SWAR_recursive):
#include <immintrin.h>
#include <cassert>
#include <chrono>
#include <cinttypes>
#include <cmath>
#include <cstdio>
#include <cstring>
using namespace std;
using namespace std::chrono;
// avoid the #include <asmlib.h>
extern "C" u_int64_t ReadTSC();
static double deviation (double n, double sum2, double sum) { return sqrt (sum2/n - (sum/n)*(sum/n)); }
// Recursive SWAR approach (with template specialization)
template<int DEPTH>
struct RecursiveSWAR
// Number of accumulators for current depth
static const int N = 1<<DEPTH;
// Array of N-bit accumulators
typedef __m256i Array[N];
// Magic numbers (0x55555555, 0x33333333, ...) computed recursively
static const u_int32_t MAGIC_NUMBER =
* (1 + (1<<(1<<(DEPTH-1))))
/ (1 + (1<<(1<<(DEPTH+0))));
static void fillAccumulators (u_int32_t*& begin, const u_int32_t* end, Array accumulators)
// We reset the N-bit accumulators
for (int i=0; i<N; i++) { accumulators[i] = _mm256_setzero_si256(); }
// We check (only for depth big enough) that we have still rows to process
if (DEPTH>=3) if (begin>=end) { return; }
typename RecursiveSWAR<DEPTH-1>::Array accumulatorsMinusOne;
// We load a register with the mask
__m256i mask = _mm256_set1_epi32 (RecursiveSWAR<DEPTH-1>::MAGIC_NUMBER);
// We fill the N-bit accumulators to their maximum capacity without carry overflow
for (int i=0; i<N+1; i++)
// We fill (N-1)-bit accumulators recursively
RecursiveSWAR<DEPTH-1>::fillAccumulators (begin, end, accumulatorsMinusOne);
// We update the N-bit accumulators from the (N-1)-bit accumulators
for (int j=0; j<RecursiveSWAR<DEPTH-1>::N; j++)
// LOW part
accumulators[2*j+0] = _mm256_add_epi32 (
_mm256_and_si256 (
// HIGH part
accumulators[2*j+1] = _mm256_add_epi32 (
_mm256_and_si256 (
_mm256_srli_epi32 (
// Template specialization for DEPTH=0
struct RecursiveSWAR<0>
static const int N = 1;
typedef __m256i Array[N];
static const u_int32_t MAGIC_NUMBER = 0x55555555;
static void fillAccumulators (u_int32_t*& begin, const u_int32_t* end, Array result)
// We just load 8 rows in the AVX2 register
result[0] = _mm256_loadu_si256 ((__m256i*)begin);
// We update the iterator
begin += 1*sizeof(__m256i)/sizeof(u_int32_t);
template<int DEPTH> struct TypeInfo { };
template<> struct TypeInfo<3> { typedef u_int8_t Type; };
template<> struct TypeInfo<4> { typedef u_int16_t Type; };
template<> struct TypeInfo<5> { typedef u_int32_t Type; };
unsigned char reversebits (unsigned char b)
return ((b * 0x80200802ULL) & 0x0884422110ULL) * 0x0101010101ULL >> 32;
void test_SWAR_recursive (uint64_t nbRows, const uint32_t* bitmap, uint32_t* globalSums)
static const int DEPTH = 4;
RecursiveSWAR<DEPTH>::Array accumulators;
uint32_t* begin = (uint32_t*) bitmap;
const uint32_t* end = bitmap + nbRows;
// We reset the grand totals
for (int i=0; i<32; i++) { globalSums[i] = 0; }
while (begin < end)
// We fill the N-bit accumulators to the maximum without overflow
RecursiveSWAR<DEPTH>::fillAccumulators (begin, end, accumulators);
// We update grand totals from the filled N-bit accumulators
for (int i=0; i<RecursiveSWAR<DEPTH>::N; i++)
int r = reversebits(i) >> (8-DEPTH);
u_int32_t* sums = globalSums+r;
TypeInfo<DEPTH>::Type* values = (TypeInfo<DEPTH>::Type*) (accumulators+i);
for (int j=0; j<8*(1<<(5-DEPTH)); j++)
sums[(j*RecursiveSWAR<DEPTH>::N) % 32] += values[j];
void execute (
const char* name,
void (*fct)(uint64_t nbRows, const uint32_t* bitmap, uint32_t* globalSums),
size_t nbRuns,
uint64_t nbRows,
u_int32_t* bitmap
uint32_t sums[32];
double timeTotal=0;
double cycleTotal=0;
double timeTotal2=0;
double cycleTotal2=0;
uint64_t check=0;
for (size_t n=0; n<nbRuns; n++)
// We want both time and cpu cycles information
milliseconds t0 = duration_cast< milliseconds >(system_clock::now().time_since_epoch());
uint64_t c0 = ReadTSC();
// We run the test
(*fct) (nbRows, bitmap, sums);
uint64_t c1 = ReadTSC();
milliseconds t1 = duration_cast< milliseconds >(system_clock::now().time_since_epoch());
timeTotal += (t1-t0).count();
cycleTotal += (double)(c1-c0) / nbRows;
timeTotal2 += (t1-t0).count() * (t1-t0).count();
cycleTotal2 += ((double)(c1-c0) / nbRows) * ((double)(c1-c0) / nbRows);
// We compute some dummy checksum
for (size_t k=0; k<32; k++) { check += (k+1)*sums[k]; }
printf ("%-21s | %5.0lf (%5.1lf) | %5.2lf (%5.3lf) | %.3lf | 0x%lx\n",
timeTotal / nbRuns,
deviation (nbRuns, timeTotal2, timeTotal),
deviation (nbRuns, cycleTotal2, cycleTotal),
nbRows * cycleTotal / timeTotal / 1000000.0,
int main(int argc, char **argv)
// We set rows number as 2^n where n is the provided argument
// For simplification, we assume that the rows number is a multiple of 32
uint64_t nbRows = 1ULL << (argc>1 ? atoi(argv[1]) : 28);
size_t nbRuns = argc>2 ? atoi(argv[2]) : 10;
// We build an bitmap of size nbRows*32
uint64_t actualNbRows = nbRows + 100000;
uint32_t* bitmap = (uint32_t*)_mm_malloc(sizeof(uint32_t)*actualNbRows, 256);
if (bitmap==nullptr)
fprintf(stderr, "unable to allocate the bitmap\n");
memset (bitmap, 0, sizeof(u_int32_t)*actualNbRows);
// We fill the bitmap with random values
// srand(time(nullptr));
for (uint64_t i=0; i<nbRows; i++) { bitmap[i] = rand() & 0xFFFFFFFF; }
printf ("\n");
printf ("nbRows=%ld nbRuns=%ld\n", nbRows, nbRuns);
printf ("------------------------------------------------------------------------------------------------------------\n");
printf ("name | time in msec : mean (sd) | cycles/row : mean (sd) | frequency in GHz | checksum\n");
printf ("------------------------------------------------------------------------------------------------------------\n");
// We launch the benchmark
execute ("AVX2 (SWAR rec) ", test_SWAR_recursive, nbRuns, nbRows, bitmap);
printf ("\n");
// Some clean up
_mm_free (bitmap);
The size of the accumulators is 2DEPTH in this code. Note that this implementation is valid up to DEPTH=5. For DEPTH=4, here are the performance results compared to the implementation of Peter Cordes (named high entropy SWAR):
The graph gives the number of cycles required to process a row (of 32 items) as a function of the number of rows of the matrix. As expected, the results are pretty similar since the main idea is the same. It is interesting to note the three parts of the graph:
constant value for log2(n)<=20
increasing value for log2(n) between 20 and 22
constant value for log2(n)>=22
I guess that CPU caches properties can explain this behaviour.
I am using Superpowered for various real-time FX and they all work very straightforward. However the pitch shifting is a whole other story, I think in fact because it's based on the time-stretching algorithm that of course has to deal with output that changes in time which is a lot more complex than applying FX like EQ or reverb. However I'm only interested in change the pitch of my mic input.
I looked at the only example I could find on GitHub and I slightly adapted it to fit my work:
static bool audioProcessing(void *clientdata,
float **buffers,
unsigned int inputChannels,
unsigned int outputChannels,
unsigned int numberOfSamples,
unsigned int samplerate,
uint64_t hostTime) {
__unsafe_unretained Superpowered *self = (__bridge Superpowered *)clientdata;
SuperpoweredAudiobufferlistElement inputBuffer;
inputBuffer.startSample = 0;
inputBuffer.samplesUsed = 0;
inputBuffer.endSample = self->timeStretcher->numberOfInputSamplesNeeded;
inputBuffer.buffers[0] = SuperpoweredAudiobufferPool::getBuffer(self->timeStretcher->numberOfInputSamplesNeeded * 8 + 64);
inputBuffer.buffers[1] = inputBuffer.buffers[2] = inputBuffer.buffers[3] = NULL;
self->timeStretcher->process(&inputBuffer, self->outputBuffers);
int samples = self->timeStretcher->numberOfInputSamplesNeeded;
float *timeStretchedAudio = (float *)self->outputBuffers->nextSliceItem(&samples);
if (timeStretchedAudio != 0) {
SuperpoweredDeInterleave(timeStretchedAudio, buffers[0], buffers[1], numberOfSamples);
return true;
I have removed most of the code that I thought wasn't necessary. For example there was a while loop that seemed to deal with time-stretch scenarios, I'm just outputting the same time as I input.
Some observations:
If I don't clear the outputBuffers my memory usage goes through the roof
If I use self->outputBuffers->rewindSlice(); the app becomes silent, probably meaning the buffers are getting overwritten with silence
If I do not use self->outputBuffers->rewindSlice(); I can hear my own voice coming back, but timeStretchedAudio is always 0 except the very first time
I finally got it working:
static bool audioProcessing(void *clientdata,
float **buffers,
unsigned int inputChannels,
unsigned int outputChannels,
unsigned int numberOfSamples,
unsigned int samplerate,
uint64_t hostTime) {
__unsafe_unretained Superpowered *self = (__bridge Superpowered *)clientdata;
//timeStretching->setRateAndPitchShift(realTimeRate, realTimePitch);
SuperpoweredAudiobufferlistElement inputBuffer;
inputBuffer.startSample = 0;
inputBuffer.samplesUsed = 0;
inputBuffer.endSample = numberOfSamples;
inputBuffer.buffers[0] = SuperpoweredAudiobufferPool::getBuffer((unsigned int) (numberOfSamples * 8 + 64));
inputBuffer.buffers[1] = inputBuffer.buffers[2] = inputBuffer.buffers[3] = NULL;
// Converting the 16-bit integer samples to 32-bit floating point.
SuperpoweredInterleave(buffers[0], buffers[1], (float *)inputBuffer.buffers[0], numberOfSamples);
//SuperpoweredShortIntToFloat(audioInputOutput, (float *)inputBuffer.buffers[0], (unsigned int) numberOfSamples);
self->timeStretcher->process(&inputBuffer, self->outputBuffers);
// Do we have some output?
if (self->outputBuffers->makeSlice(0, self->outputBuffers->sampleLength)) {
while (true) { // Iterate on every output slice.
// Get pointer to the output samples.
int numSamples = 0;
float *timeStretchedAudio = (float *)self->outputBuffers->nextSliceItem(&numSamples);
if (!timeStretchedAudio || *timeStretchedAudio == 0) {
// Convert the time stretched PCM samples from 32-bit floating point to 16-bit integer.
//SuperpoweredFloatToShortInt(timeStretchedAudio, audioInputOutput,
// (unsigned int) numSamples);
SuperpoweredDeInterleave(timeStretchedAudio, buffers[0], buffers[1], numSamples);
self->recorder->process(timeStretchedAudio, numSamples);
// Write the audio to disk.
//fwrite(audioInputOutput, 1, numSamples * 4, fd);
// Clear the output buffer list.
// If we have enough samples in the fifo output buffer, pass them to the audio output.
//SuperpoweredFloatToShortInt((float *)inputBuffer.buffers[0], audioInputOutput, (unsigned int) numberOfSamples);
return true;
I am not sure if changing the rate also works, but I don't care for this application. YMMV.
Implement the part marked with TODO. That's the point where you need to provide input for the timeStretcher. Also take care of separating the output from the input. Output could be written before the input is consumed.
I have a simple program that creates a single cycle sine wave and puts the float numbers to a buffer. Then this is exported to a text file.
But I want to be able to export it to a WAV file (24 bit). Is there a simple way of doing it like on the text file?
Here is the code I have so far:
#include <iostream>
#include <fstream>
#include <cmath>
using namespace std;
int main ()
long double pi = 3.14159265359; // Declaration of PI
ofstream textfile; // Text object
textfile.open("sine.txt"); // Creating the txt
double samplerate = 44100.00; // Sample rate
double frequency = 200.00; // Frequency
int bufferSize = (1/frequency)*samplerate; // Buffer size
double buffer[bufferSize]; // Buffer
for (int i = 0; i <= (1/frequency)*samplerate; ++i) // Single cycle
buffer[i] = sin(frequency * (2 * pi) * i / samplerate); // Putting into buffer the float values
textfile << buffer[i] << endl; // Exporting to txt
textfile.close(); // Closing the txt
return 0; // Success
First you need to open the stream for binary.
ofstream stream;
stream.open("sine.wav", ios::out | ios::binary);
Next you'll need to write out a wave header. You can search to find the details of the wave file format. The important bits are the sample rate, bit depth, and length of the data.
int bufferSize = (1/frequency)*samplerate;
stream.write("RIFF", 4); // RIFF chunk
write<int>(stream, 36 + bufferSize*sizeof(int)); // RIFF chunk size in bytes
stream.write("WAVE", 4); // WAVE chunk
stream.write("fmt ", 4); // fmt chunk
write32(stream, 16); // size of fmt chunk
write16(stream, 1); // Format = PCM
write16(stream, 1); // # of Channels
write32(stream, samplerate); // Sample Rate
write32(stream, samplerate*sizeof(int)); // Byte rate
write16(stream, sizeof(int)); // Frame size
write16(stream, 24); // Bits per sample
stream.write("data", 4); // data chunk
write32(stream, bufferSize*sizeof(int)); // data chunk size in bytes
Now that the header is out of the way, you'll just need to modify your loop to first convert the double (-1.0,1.0) samples into 32-bit signed int. Truncate the bottom 8-bits since you only want 24-bit and then write out the data. Just so you know, it is common practice to store 24-bit samples inside of a 32-bit word because it is much easier to stride through using native types.
for (int i = 0; i < bufferSize; ++i) // Single cycle
double tmp = sin(frequency * (2 * pi) * i / samplerate);
int intVal = (int)(tmp * 2147483647.0) & 0xffffff00;
stream << intVal;
A couple other things:
1) I don't know how you weren't overflowing buffer by using the <= in your loop. I changed it to a <.
2) Again regarding the buffer size. I'm not sure if you are aware but you can't have a repeated waveform represented by a single cycle for all frequencies. What I mean is that for most frequencies if you use this code and expect to play the waveform repeated, you're going to hear a glitch on every cycle. It'll work for nice synchronous frequencies like 1kHz because there will be exactly 48 samples per cycle and it will come around to exactly the same phase. 999.9 Hz will be a different story though.
I wrote two functions which should export an audio-float buffer into a .wav-file, but I have problems with playing the exported file. Audacity plays it like it should be (sounds exactly like within my application), however, Ableton (DAW-software) seems to misinterprets some part of the wav so it sounds realy distorted. (like a distortion-effekt)
I guess that ableton somehow assumes a wrong sample-depth (smaller) so the actuall samples blow the limits.
I have two functions, the one creates an int32_t buffer from two float-buffers (mixing left and right into one buffer), the other function writes the .wav-file, including the format chunk etc. I guess that somewhere there is the problem.
class members / structs
// static I use in the export function
static const int FORMAT_PCM = 1;
static const int CHANNEL_COUNT = 2; // fix stereo
static const int BYTES_PER_SAMPLE = 4; // fix bytes per sample, 32bit audio
// a function I found in the internet, helps writting the bytes to the file
template <typename T>
static void write(std::ofstream& stream, const T& t) {
stream.write((const char*)&t, sizeof(T));
// used "structure" to store the buffer
class StereoAudioBuffer {
StereoAudioBuffer(int length) : sizeInSamples(2*length){
samples = new int32_t[2*length];
~StereoAudioBuffer() {delete samples;};
int32_t *samples;
const int sizeInSamples;
converting function
StereoAudioBuffer* WaveExport::convertTo32BitStereo(
float *leftSamples,
int length)
StereoAudioBuffer *buffer = new StereoAudioBuffer(length);
float max = 0;
// find max sample
for(int i = 0; i < length; i++) {
if(abs(leftSamples[i]) > max) {
max = abs(leftSamples[i]);
if(abs(rightSamples[i]) > max) {
max = abs(rightSamples[i]);
// normalise and scale to size(int32_t)
float factor = 2147483000.0f / max;
for(int i = 0; i < length; i++) {
buffer->samples[2*i] = leftSamples[i] * factor ;
buffer->samples[2*i+1] = rightSamples[i] * factor;
return buffer;
the exporting function (part of this code comes from the internet, sadly, I can't find the source anymore
void WaveExport::writeStereoWave(
const char *path,
StereoAudioBuffer* buffer,
int sampleRate)
std::ofstream stream(path, std::ios::binary);
stream.write("RIFF", 4);
write<int>(stream, 36 + buffer->sizeInSamples * BYTES_PER_SAMPLE); // 32 bits -> 4 bytes
stream.write("WAVE", 4);
stream.write("fmt ", 4);
write<int>(stream, 16);
write<short>(stream, FORMAT_PCM); // Format
write<short>(stream, CHANNEL_COUNT); // Channels
write<int>(stream, sampleRate); // Sample Rate
write<int>(stream, sampleRate * CHANNEL_COUNT * BYTES_PER_SAMPLE); // Byterate
write<short>(stream, CHANNEL_COUNT * BYTES_PER_SAMPLE); // Frame size
write<short>(stream, 8 * BYTES_PER_SAMPLE); // Bits per sample
int dataChunkSize = buffer->sizeInSamples * BYTES_PER_SAMPLE;
stream.write("data", 4);
stream.write((const char*)&dataChunkSize, 4);
stream.write((const char*)buffer->samples, BYTES_PER_SAMPLE*buffer->sizeInSamples);
Does anybody know how to write .wav files and maybe can tell me what I did wrong or missed?
There was no problem. I used 32bit .wav which just wasn't supported in the application, I used for playback.
I changed the export functions to use int16_t, 16bit depth, and it works fine.
I'm developping imaging functions (yes I REALLY want to reinvent the wheel for various reasons).
I'm copying bitmaps into unsigned char arrays but I'm having some problem with byte size versus image pixel format.
for example a lot of images come as 24 bits per pixel for RGB representation so that's rather easy, every pixel has 3 unsigned chars (bytes) and everyone is happy
however sometimes the RGB has weirder types like 48 bits per pixel so 16 bits per color channel. Copying the whole image into the byte array works fine but its when I want to retrieve the data that things get blurry
Right now I have the following code to get a single pixel for grayscale images
unsigned char NImage::get_pixel(int i, int j)
return this->data[j * pitch + i];
NImage::data is unsigned char array
This returns a single byte. How can I access my data array with different pixel formats?
You should do it like this:
unsigned short NImage::get_pixel(int i, int j)
int offset = 2 * (j * pitch + i);
// image pixels are usually stored in big-endian format
return data[offset]*256 + data[offset+1];
At 48 bits per pixel, with 16 bit per color, you can't return an 8 bit value, you must return a 16 bit short or unsigned short otherwise the data gets truncated.
You might try developing overloaded functions to handle this.
You have to know how big your pixels are.
If it's RGB then your 100x100 pixel image (say) will have 30,000 unsigned chars.
unsigned char NImage::get_red_component(int i, int j)
return this->data[3*(j * pitch + i)];
unsigned char NImage::get_green_component(int i, int j)
return this->data[3*(j * pitch + i) + 1];
unsigned char NImage::get_blue_component(int i, int j)
return this->data[3*(j * pitch + i) + 2];
Or for 48-bit RGB,
unsigned char NImage::get_red_MSB(int i, int j)
return this->data[6*(j * pitch + i)];
unsigned char NImage::get_red_LSB(int i, int j)
return this->data[6*(j * pitch + i) + 1];
... etc etc ...
What's the problem with 48bits per pixel? Simply read your data as uint16_t or unsigned short and you get the 16 bit extracted properly.
It gets worse for more complicated bit pattern, i.e. rgb565 where you'll need to extract data using bitmasks.