Cortex M4 DSP: Maximum of table int16_t - cortex-m

I would like to accelerate the following code which computes the maximum value of a table:
int16_t max = 0;
int16_t *in16 = (int16_t *)myptr;
for (i = 0; i < 256; i++) {
int16_t val = *in16++;
if (val > max)
max = val;
}
return max;
Is there a Cortex M4 DSP instruction that computes the maximum of adjacent values. I can't find any.
In CMSIS, I see "arm_max_q15". How is implemented this function?

There doesn't seem to be much acceleration possible. The code of arm_max_q15 is here: https://github.com/ARM-software/CMSIS/blob/master/CMSIS/DSP_Lib/Source/StatisticsFunctions/arm_max_q15.c

Related

Given N integers, remove 1 of them so that their XOR is as large as possible

You are given N 64-bit integers (or long longs). You need to remove one of then so that the XOR of those N-1 (all of them without the one that has been removed) integers, is as large as possible. Print out the XOR in the console. The number of integers will not exceed 1e6.
By XOR of all the integers I mean something like this:
long long myXor=0;
for(int i = 0;i<arr.size();i++){
myXor=xor(myXor,arr[i]);
}
Also, this is not my homework. I know posting homework here is frowned upon. I've been trying to solve this my self, but I've only come up with solutions that work in O(n^2).
(Intelligent) brute-forcing might actually be the best thing.
And you won't even need O(n²) for that.
Since xor is reversible you can calculate the xor of all numbers first and xor again with one number to get the xor of all except that number.
Using this it is quite easy to reduce the brute-force solution to O(n):
long long xorAll = 0;
for(int i = 0;i<arr.size();i++){
xorAll = myXor^arr[i];
}
long long max = LONG_LONG_MIN; // Store the maximum number.
for(int i = 0; i < arr.size(); i++) {
if((xorAll^arr[i]) > max)
max = xorAll^arr[i];
}
#include <iostream>
#include <limits.h>
typedef long long ll;
const int MAX_N = 1000000;
int main() {
// initialize the total xor
ll tot = 0;
// static allocation is faster :)
ll nums[MAX_N];
// read in the number of elements
int n;
std::cin >> n;
for (int i = 0; i < n; i ++) {
std::cin >> nums[i];
// use the built in xor operator (^)
tot ^= nums[i];
}
// initialize the maximum value to the smallest possible long
ll maxVal = LONG_LONG_MIN;
for (int i = 0; i < n; i ++) {
// xor undos itself so this is essentially removing nums[i]
ll val = tot ^ nums[i];
// if it's larger than the max then update the max
if (val > maxVal) {
maxVal = val;
}
}
// output it
std::cout << maxVal << std::endl;
}
There are other answers showing how this can be done in O(n) time quite easily, e.g.
// xor all the numbers
long long myXor = 0;
for(auto n : arr) {
myXor ^= n;
}
// find the max
long long max = std::numeric_limits<long long>::min();
for(auto n : arr) {
max = std::max(max, myXor ^ n);
}
However, you can use the property that the xor operations can be done out of order. This lets you use the reduce function in numeric, like so
// xor all the numbers
auto myXor = std::reduce(arr.cbegin(), arr.cend(), 0ll, std::bit_xor{});
auto choose = [myXor] (auto max, auto n) { return std::max(max, myXor ^ n);};
// find the max
auto max = std::reduce(arr.cbegin(), arr.cend(),
std::numeric_limits<long long>::min(), choose);
Here is a quick and dirty comparison between the 2 solutions, that show a considerable speedup (about 40% for 1e6 numbers).

Fast search/replace of matching single bytes in a 8-bit array, on ARM

I develop image processing algorithms (using GCC, targeting ARMv7 (Raspberry Pi 2B)).
In particular I use a simple algorithm, which changes index in a mask:
void ChangeIndex(uint8_t * mask, size_t size, uint8_t oldIndex, uint8_t newIndex)
{
for(size_t i = 0; i < size; ++i)
{
if(mask[i] == oldIndex)
mask[i] = newIndex;
}
}
Unfortunately it has poor performance for the target platform.
Is there any way to optimize it?
The ARMv7 platform supports SIMD instructions called NEON.
With use of them you can make you code faster:
#include <arm_neon.h>
void ChangeIndex(uint8_t * mask, size_t size, uint8_t oldIndex, uint8_t newIndex)
{
size_t alignedSize = size/16*16, i = 0;
uint8x16_t _oldIndex = vdupq_n_u8(oldIndex);
uint8x16_t _newIndex = vdupq_n_u8(newIndex);
for(; i < alignedSize; i += 16)
{
uint8x16_t oldMask = vld1q_u8(mask + i); // loading of 128-bit vector
uint8x16_t condition = vceqq_u8(oldMask, _oldIndex); // compare two 128-bit vectors
uint8x16_t newMask = vbslq_u8(condition, _newIndex, oldMask); // selective copying of 128-bit vector
vst1q_u8(mask + i, newMask); // saving of 128-bit vector
}
for(; i < size; ++i)
{
if(mask[i] == oldIndex)
mask[i] = newIndex;
}
}

How can I subtract two IPv6 addresses (128bit numbers) in C/C++?

I'm storing the IP address in sockaddr_in6 which supports an array of four 32-bit, addr[4]. Essentially a 128 bit number.
I'm trying to calculate number of IPs in a given IPv6 range (how many IPs between). So it's a matter of subtracting one from another using two arrays with a length of four.
The problem is since there's no 128bit data type, I can't convert into decimal.
Thanks a ton!
You could use some kind of big-int library (if you can tolerate LGPL, GMP is the choice). Fortunately, 128 bit subtraction is easy to simulate by hand if necessary. Here is a quick and dirty demonstration of computing the absolute value of (a-b), for 128 bit values:
#include <iostream>
#include <iomanip>
struct U128
{
unsigned long long hi;
unsigned long long lo;
};
bool subtract(U128& a, U128 b)
{
unsigned long long carry = b.lo > a.lo;
a.lo -= b.lo;
unsigned long long carry2 = b.hi > a.hi || a.hi == b.hi && carry;
a.hi -= carry;
a.hi -= b.hi;
return carry2 != 0;
}
int main()
{
U128 ipAddressA = { 45345, 345345 };
U128 ipAddressB = { 45345, 345346 };
bool carry = subtract(ipAddressA, ipAddressB);
// Carry being set means that we underflowed; that ipAddressB was > ipAddressA.
// Lets just compute 0 - ipAddressA as a means to calculate the negation
// (0-x) of our current value. This gives us the absolute value of the
// difference.
if (carry)
{
ipAddressB = ipAddressA;
ipAddressA = { 0, 0 };
subtract(ipAddressA, ipAddressB);
}
// Print gigantic hex string of the 128-bit value
std::cout.fill ('0');
std::cout << std::hex << std::setw(16) << ipAddressA.hi << std::setw(16) << ipAddressA.lo << std::endl;
}
This gives you the absolute value of the difference. If the range is not huge (64 bits or less), then ipAddressA.lo can be your answer as a simple unsigned long long.
If you have perf concerns, you can make use of compiler intrinsics for taking advantage of certain architectures, such as amd64 if you want it to be optimal on that processor. _subborrow_u64 is the amd64 intrinsic for the necessary subtraction work.
The in6_addr structure stores the address in network byte order - or 'big endian' - with the most significant byte # s6_addr[0]. You can't count on the other union members being consistently named, or defined. Even If you accessed the union through a (non-portable) uint32_t field, the values would have to be converted with ntohl. So a portable method of finding the difference needs some work.
You can convert the in6_addr to uint64_t[2]. Sticking with typical 'bignum' conventions, we use [0] for the low 64-bits and [1] for the high 64-bits:
static inline void
in6_to_u64 (uint64_t dst[2], const struct in6_addr *src)
{
uint64_t hi = 0, lo = 0;
for (unsigned int i = 0; i < 8; i++)
{
hi = (hi << 8) | src->s6_addr[i];
lo = (lo << 8) | src->s6_addr[i + 8];
}
dst[0] = lo, dst[1] = hi;
}
and the difference:
static inline unsigned int
u64_diff (uint64_t d[2], const uint64_t x[2], const uint64_t y[2])
{
unsigned int b = 0, bi;
for (unsigned int i = 0; i < 2; i++)
{
uint64_t di, xi, yi, tmp;
xi = x[i], yi = y[i];
tmp = xi - yi;
di = tmp - b, bi = tmp > xi;
d[i] = di, b = bi | (di > tmp);
}
return b; /* borrow flag = (x < y) */
}

Cubic root function cbrt() in Visual Studio 2012

I am writing a program in Visual Studio 2012 Professional (Windows) in C/C++ which consists of calculating many powers using pow(). I ran the profiler to find out why it takes such a long time to run and I found that pow() is the bottleneck.
I have rewritten the powers such as
pow(x,1.5) to x*sqrt(x)
and
pow(x,1.75) to sqrt(x*x*x*sqrt(x))
which significantly improved the speed of the program.
A few powers are of the kind pow(x,1.0/3.0) so I looked for the cubic root function cbrt() to speed up things but it seems not available in Visual Studio which I can hardly imagine so therefore my question:
Where can I find the cbrt() function in Visual Studio 2012 Professional and if not, what are the alternatives except for pow(x,1.0/3.0)?
Kind regards,
Ernst Jan
This site explores several computational methods to compute cube root efficiently in C, and has some source code you can download.
(EDIT: A google search for "fast cube root" comes up with several more promising-looking hits.)
Cube roots are a topic of interest, because they're used in many common formulae and a fast cube root function isn't included with Microsoft Visual Studio.
In the absence of a special cube root function, a typical strategy is calculation via a power function (e.g., pow(x, 1.0/3.0)). This can be problematic in terms of speed and in terms of accuracy when negative numbers aren't handled properly.
His site has some benchmarks on the methods used. All of them are much faster than pow().
32-bit float tests
----------------------------------------
cbrt_5f 8.8 ms 5 mbp 6.223 abp
pow 144.5 ms 23 mbp 23.000 abp
halley x 1 31.8 ms 15 mbp 18.961 abp
halley x 2 59.0 ms 23 mbp 23.000 abp
newton x 1 23.4 ms 10 mbp 12.525 abp
newton x 2 48.9 ms 20 mbp 22.764 abp
newton x 3 72.0 ms 23 mbp 23.000 abp
newton x 4 89.6 ms 23 mbp 23.000 abp
See the site for downloadable source.
Implementation below is 4x faster than std::pow with relatively higher tolerance (0.000001) on an AVX-512 CPU. It is made of vertically auto-vectorizable loops for every basic operation like multiplication and division so that it computes 8,16,32 elements at once instead of horizontally vectorizing the Newton-Raphson loop.
#include <cmath>
/*
Newton-Raphson iterative solution
f_err(x) = x*x*x - N
f'_err(x) = 3*x*x
x = x - (x*x*x - N)/(3*x*x)
x = x - (x - N/(x*x))/3 <--- repeat until error < tolerance
but with vertical-parallelization
*/
template < typename Type, int Simd, int inverseTolerance>
inline
void cubeRootFast(Type *const __restrict__ data,
Type *const __restrict__ result) noexcept
{
// alignment 64 required for AVX512 vectorization
alignas(64)
Type xd[Simd];
alignas(64)
Type resultData[Simd];
alignas(64)
Type xSqr[Simd];
alignas(64)
Type nDivXsqr[Simd];
alignas(64)
Type diff[Simd];
// cube root checking mask
for (int i = 0; i < Simd; i++)
{
xd[i] = data[i] <= Type(0.000001);
}
// skips division by zero if input is zero or close to zero
for (int i = 0; i < Simd; i++)
{
resultData[i] = xd[i] ? Type(1.0) : data[i];
}
// Newton-Raphson Iterations in parallel
bool work = true;
while (work)
{
// compute x*x
for (int i = 0; i < Simd; i++)
{
xSqr[i] = resultData[i] *resultData[i];
}
// compute N/(x*x)
for (int i = 0; i < Simd; i++)
{
nDivXsqr[i] = data[i] / xSqr[i];
}
// compute x - N/(x*x)
for (int i = 0; i < Simd; i++)
{
nDivXsqr[i] = resultData[i] - nDivXsqr[i];
}
// compute (x-N/(x*x))/3
for (int i = 0; i < Simd; i++)
{
nDivXsqr[i] = nDivXsqr[i] / Type(3.0);
}
// compute x - (x-N/(x*x))/3
for (int i = 0; i < Simd; i++)
{
diff[i] = resultData[i] - nDivXsqr[i];
}
// compute error
for (int i = 0; i < Simd; i++)
{
diff[i] = resultData[i] - diff[i];
}
// compute absolute error
for (int i = 0; i < Simd; i++)
{
diff[i] = std::abs(diff[i]);
}
// compute condition to stop looping (error < tolerance)?
for (int i = 0; i < Simd; i++)
{
diff[i] = diff[i] > Type(1.0/inverseTolerance);
}
// all SIMD lanes have to have zero work left to end
Type check = 0;
for (int i = 0; i < Simd; i++)
{
check += diff[i];
}
work = (check > Type(0.0));
// compute the next x guess
for (int i = 0; i < Simd; i++)
{
resultData[i] = resultData[i] - nDivXsqr[i];
}
}
// if input was close to zero, output zero
// output result otherwise
for (int i = 0; i < Simd; i++)
{
result[i] = xd[i] ? Type(0.0) : resultData[i];
}
}
#include <iostream>
int main()
{
constexpr int n = 8192;
constexpr int simd = 16;
constexpr int inverseTolerance = 1000;
float data[n];
for (int i = 0; i < n; i++)
{
data[i] = i;
}
for (int i = 0; i < n; i += simd)
{
cubeRootFast<float, simd, inverseTolerance> (data + i, data + i);
}
for (int i = 0; i < 10; i++)
std::cout << data[i *i *i] << std::endl;
return 0;
}
It is tested only with GCC so it may require extra MSVC-pragmas on each loop to force auto-vectorization. If you have OpenMP, then you can also use #pragma omp simd safelen(Simd) to achieve same thing.
The performance only holds within [0,1] range. To use bigger values, you should use range reduction like this:
// example: max value is 1000
for(auto & input:inputs)
input = input/1000.0f // normalize
for(..)
cubeRootFast<float, simd, inverseTolerance> (input + i, input + i)
for(auto & input:inputs)
input = 10.0f*input // de-normalize (1000 = 10 x 10 x 10)
If you need only 0.005 error on low-range like [0,1000] with 16x speedup, you can try below implementation that uses polynomial approximation (Horner-Scheme is applied to compute with FMA instructions and no explicit auto-vectorization required since it doesn't include any branching/loop inside):
// optimized for [0,1] range: ~1 cycles on AVX512, 0.003 average error
// polynomial approximation with Horner Scheme for FMA optimization
template<typename T>
T cubeRootFast(T x)
{
T xd = x-T(1.0);
T result = T(-55913.0/4782969.0);
result *= xd;
result += T(21505.0/1594323.0);
result *= xd;
result += T(-935.0/59049.0);
result *= xd;
result += T(374.0/19683.0);
result *= xd;
result += T(-154.0/6561.0);
result *= xd;
result += T(22.0/729.0);
result *= xd;
result += T(-10.0/243.0);
result *= xd;
result += T(5.0/81.0);
result *= xd;
result += T(-1.0/9.0);
result *= xd;
result += T(1.0/3.0);
result *= xd;
result += T(1.0);
return result;
}
// range reduction + dereduction: ~ 1 cycles on AVX512
for(int i=0;i<8192;i++)
{
float inp = input[i];
// scaling + descaling for range [1,999]
float scaling = (inp>333.0f)?(1000.0f):(333.0f);
scaling = (inp>103.0f)?scaling:(103.0f);
scaling = (inp>29.0f)?scaling:(29.0f);
scaling = (inp>7.0f)?scaling:(7.0f);
scaling = (inp>3.0f)?scaling:(3.0f);
output[i] = powf(scaling,0.33333333333f)*cubeRootFast<float>(inp/scaling);
}

C++ use SSE instructions for comparing huge vectors of ints

I have a huge vector<vector<int>> (18M x 128). Frequently I want to take 2 rows of this vector and compare them by this function:
int getDiff(int indx1, int indx2) {
int result = 0;
int pplus, pminus, tmp;
for (int k = 0; k < 128; k += 2) {
pplus = nodeL[indx2][k] - nodeL[indx1][k];
pminus = nodeL[indx1][k + 1] - nodeL[indx2][k + 1];
tmp = max(pplus, pminus);
if (tmp > result) {
result = tmp;
}
}
return result;
}
As you see, the function, loops through the two row vectors does some subtraction and at the end returns a maximum. This function will be used a million times, so I was wondering if it can be accelerated through SSE instructions. I use Ubuntu 12.04 and gcc.
Of course it is microoptimization but it would helpful if you could provide some help, since I know nothing about SSE. Thanks in advance
Benchmark:
int nofTestCases = 10000000;
vector<int> nodeIds(nofTestCases);
vector<int> goalNodeIds(nofTestCases);
vector<int> results(nofTestCases);
for (int l = 0; l < nofTestCases; l++) {
nodeIds[l] = randomNodeID(18000000);
goalNodeIds[l] = randomNodeID(18000000);
}
double time, result;
time = timestamp();
for (int l = 0; l < nofTestCases; l++) {
results[l] = getDiff2(nodeIds[l], goalNodeIds[l]);
}
result = timestamp() - time;
cout << result / nofTestCases << "s" << endl;
time = timestamp();
for (int l = 0; l < nofTestCases; l++) {
results[l] = getDiff(nodeIds[l], goalNodeIds[l]);
}
result = timestamp() - time;
cout << result / nofTestCases << "s" << endl;
where
int randomNodeID(int n) {
return (int) (rand() / (double) (RAND_MAX + 1.0) * n);
}
/** Returns a timestamp ('now') in seconds (incl. a fractional part). */
inline double timestamp() {
struct timeval tp;
gettimeofday(&tp, NULL);
return double(tp.tv_sec) + tp.tv_usec / 1000000.;
}
FWIW I put together a pure SSE version (SSE4.1) which seems to run around 20% faster than the original scalar code on a Core i7:
#include <smmintrin.h>
int getDiff_SSE(int indx1, int indx2)
{
int result[4] __attribute__ ((aligned(16))) = { 0 };
const int * const p1 = &nodeL[indx1][0];
const int * const p2 = &nodeL[indx2][0];
const __m128i vke = _mm_set_epi32(0, -1, 0, -1);
const __m128i vko = _mm_set_epi32(-1, 0, -1, 0);
__m128i vresult = _mm_set1_epi32(0);
for (int k = 0; k < 128; k += 4)
{
__m128i v1, v2, vmax;
v1 = _mm_loadu_si128((__m128i *)&p1[k]);
v2 = _mm_loadu_si128((__m128i *)&p2[k]);
v1 = _mm_xor_si128(v1, vke);
v2 = _mm_xor_si128(v2, vko);
v1 = _mm_sub_epi32(v1, vke);
v2 = _mm_sub_epi32(v2, vko);
vmax = _mm_add_epi32(v1, v2);
vresult = _mm_max_epi32(vresult, vmax);
}
_mm_store_si128((__m128i *)result, vresult);
return max(max(max(result[0], result[1]), result[2]), result[3]);
}
You probably can get the compiler to use SSE for this. Will it make the code quicker? Probably not. The reason being is that there is a lot of memory access compared to computation. The CPU is much faster than the memory and a trivial implementation of the above will already have the CPU stalling when it's waiting for data to arrive over the system bus. Making the CPU faster will just increase the amount of waiting it does.
The declaration of nodeL can have an effect on the performance so it's important to choose an efficient container for your data.
There is a threshold where optimising does have a benfit, and that's when you're doing more computation between memory reads - i.e. the time between memory reads is much greater. The point at which this occurs depends a lot on your hardware.
It can be helpful, however, to optimise the code if you've got non-memory constrained tasks that can run in prarallel so that the CPU is kept busy whilst waiting for the data.
This will be faster. Double dereference of vector of vectors is expensive. Caching one of the dereferences will help. I know it's not answering the posted question but I think it will be a more helpful answer.
int getDiff(int indx1, int indx2) {
int result = 0;
int pplus, pminus, tmp;
const vector<int>& nodetemp1 = nodeL[indx1];
const vector<int>& nodetemp2 = nodeL[indx2];
for (int k = 0; k < 128; k += 2) {
pplus = nodetemp2[k] - nodetemp1[k];
pminus = nodetemp1[k + 1] - nodetemp2[k + 1];
tmp = max(pplus, pminus);
if (tmp > result) {
result = tmp;
}
}
return result;
}
A couple of things to look at. One is the amount of data you are passing around. That will cause a bigger issue than the trivial calculation.
I've tried to rewrite it using SSE instructions (AVX) using library here
The original code on my system ran in 11.5s
With Neil Kirk's optimisation, it went down to 10.5s
EDIT: Tested the code with a debugger rather than in my head!
int getDiff(std::vector<std::vector<int>>& nodeL,int row1, int row2) {
Vec4i result(0);
const std::vector<int>& nodetemp1 = nodeL[row1];
const std::vector<int>& nodetemp2 = nodeL[row2];
Vec8i mask(-1,0,-1,0,-1,0,-1,0);
for (int k = 0; k < 128; k += 8) {
Vec8i nodeA(nodetemp1[k],nodetemp1[k+1],nodetemp1[k+2],nodetemp1[k+3],nodetemp1[k+4],nodetemp1[k+5],nodetemp1[k+6],nodetemp1[k+7]);
Vec8i nodeB(nodetemp2[k],nodetemp2[k+1],nodetemp2[k+2],nodetemp2[k+3],nodetemp2[k+4],nodetemp2[k+5],nodetemp2[k+6],nodetemp2[k+7]);
Vec8i tmp = select(mask,nodeB-nodeA,nodeA-nodeB);
Vec4i tmp_a(tmp[0],tmp[2],tmp[4],tmp[6]);
Vec4i tmp_b(tmp[1],tmp[3],tmp[5],tmp[7]);
Vec4i max_tmp = max(tmp_a,tmp_b);
result = select(max_tmp > result,max_tmp,result);
}
return horizontal_add(result);
}
The lack of branching speeds it up to 9.5s but still data is the biggest impact.
If you want to speed it up more, try to change the data structure to a single array/vector rather than a 2D one (a.l.a. std::vector) as that will reduce cache pressure.
EDIT
I thought of something - you could add a custom allocator to ensure you allocate the 2*18M vectors in a contiguous block of memory which allows you to keep the data structure and still go through it quickly. But you'd need to profile it to be sure
EDIT 2: Tested the code with a debugger rather than in my head!
Sorry Alex, this should be better. Not sure it will be faster than what the compiler can do. I still maintain that it's memory access that's the issue, so I would still try the single array approach. Give this a go though.