Image maximum value with SSE instruction - c++

I'm trying to cede a function that return the maximum value of an image with the use of SSE instruction. I have a strange result of the maximum value set to be -356426400 (the value should be 254).
this is my code :
void max_sse(unsigned int *src, long h, long w, unsigned int *val)
{
unsigned int tab[16];
for(int i=0; i<h*w;i+=16)
{
__m128i PG=_mm_load_si128((__m128i*)(&src[i]));
__m128i max=_mm_max_epi8(max,PG);
_mm_store_si128((__m128i*)&tab, max);
}
*val=tab[0];
for (int i=0;i<16;i++)
{
if (tab[i]>*val)
{
*val=tab[i];
}
}
}

1) I don't see any code dealing with alignment
2) There's a mismatch between unsigned integers and _mm_max_epi8 which comapares 8-bit signed integers (http://msdn.microsoft.com/en-us/library/bb514045(v=vs.90).aspx)
3) I'm assuming you have a h*w matrix with rows multiple of 4 (or dealing with that with some padding for instance)
On Windows you could do something like:
#include "windows.h"
#include <malloc.h>
#include <smmintrin.h>
#include <iostream>
using namespace std;
void max_sse(unsigned int *src, long h, long w, unsigned int *val)
{
_STATIC_ASSERT(sizeof(unsigned int) == sizeof(BYTE)*4);
if( w % 4 != 0)
return; // ERROR Can't do it, need 4-multiple rows or do some alignment!
unsigned int *aligned_src = (unsigned int*)_aligned_malloc(h*w*sizeof(unsigned int), 16); // _mm_load_si128 needs 16-bytes aligned memory
memcpy(aligned_src, src, sizeof(unsigned int)*h*w);
__declspec(align(16)) __m128i max = {0,0,0,0};
// Iterates the matrix
for(int i=0; i<h*w; i+=4)
{
__m128i *pg = (__m128i*)(aligned_src+i);
__m128i PG = _mm_load_si128(pg);
__m128i newmax = _mm_max_epu32(max, PG);
_mm_store_si128(&max, newmax);
}
unsigned int abs_max = 0;
unsigned int *max_val = (unsigned int*)&max;
for (int i=0;i<4;i++)
{
if (abs_max < *(max_val+i))
{
abs_max = *(max_val+i);
}
}
_aligned_free(aligned_src);
cout << "The max is: " << abs_max << endl;
}
int main()
{
unsigned int src[] = {0,1,2,4, 5,6,7,8, 224,225,226,129};
unsigned int val;
max_sse(src, 3,4, &val);
return 0;
}
I'm assuming the memcpy a necessary evil in your code since there isn't any other information on memory alignment. If you have something to deal with that, do it yourself and it will be a lot better.

Related

Convert int bits to float verbatim and print them

I'm trying to just copy the contents of a 32-bit unsigned int to be used as float. Not casting it, just re-interpreting the integer bits to be used as float. I'm aware memcpy is the most-suggested option for this. However, when I do memcpy from uint_32 to float, and print out the individual bits, I see they are quite different.
Here is my code snippet:
#include <iostream>
#include <stdint.h>
#include <cstring>
using namespace std;
void print_bits(unsigned n) {
unsigned i;
for(i=1u<<31;i > 0; i/=2)
(n & i) ? printf("1"): printf("0");
}
union {
uint32_t u_int;
float u_float;
} my_union;
int main()
{
uint32_t my_int = 0xc6f05705;
float my_float;
//Method 1 using memcpy
memcpy(&my_float, &my_int, sizeof(my_float));
//Print using function
print_bits(my_int);
printf("\n");
print_bits(my_float);
//Print using printf
printf("\n%0x\n",my_int);
printf("%0x\n",my_float);
//Method 2 using unions
my_union.u_int = 0xc6f05705;
printf("union int = %0x\n",my_union.u_int);
printf("union float = %0x\n",my_union.u_float);
return 0;
}
Outputs:
11000110111100000101011100000101
11111111111111111000011111010101
c6f05705
400865
union int = c6f05705
union float = 40087b
Can someone explain what's happening? I expected the bits to match. Didn't work with a union either.
You need to change the function print_bits to
inline
int is_big_endian(void)
{
const union
{
uint32_t i;
char c[sizeof(uint32_t)];
} e = { 0x01000000 };
return e.c[0];
}
void print_bits( const void *src, unsigned int size )
{
//Check for the order of bytes in memory of the compiler:
int t, c;
if (is_big_endian())
{
t = 0;
c = 1;
}
else
{
t = size - 1;
c = -1;
}
for (; t >= 0 && t <= size - 1; t += c)
{ //print the bits of each byte from the MSB to the LSB
unsigned char i;
unsigned char n = ((unsigned char*)src)[t];
for(i = 1 << (CHAR_BIT - 1); i > 0; i /= 2)
{
printf("%d", (n & i) != 0);
}
}
printf("\n");
}
and call it like this:
int a = 7;
print_bits(&a, sizeof(a));
that way there won't be any type conversion when you call print_bits and it would work for any struct size.
EDIT: I replaced 7 with CHAR_BIT - 1 because the size of byte can be different than 8 bits.
EDIT 2: I added support for both little endian and big endian compilers.
Also as #M.M suggested in the comments if you want to you can use template to make the function call be: print_bits(a) instead of print_bits(&a, sizeof(a))

Why would this code give a segfault only for some cases?

I am trying to code a Word-RAM version of the subset sum. (It is a basic dp algorithm, and the algo itself should not be important to determine the problem with the code). This is the minimum code needed to reproduce the error I think:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
// get bit #bitno from num. 0 is most significant.
unsigned int getbit(unsigned int num, int bitno){
unsigned int w = sizeof(int)*8; //for regular word.
int shiftno = w-bitno-1;
unsigned int mask = 1<<shiftno;
unsigned int maskedn = num&mask;
unsigned int thebit = maskedn>>shiftno;
return thebit;
}
/* No boundary array right shift */
unsigned int* nbars(unsigned int orig[], unsigned int x){
int alength = sizeof(orig)/sizeof(orig[0]);
unsigned int b_s = sizeof(int)*8;
unsigned int * shifted;
shifted = new unsigned int[alength];
int i;
for(i=0;i<alength;i++){
shifted[i] = 0;
}
unsigned int aux1 = 0;
unsigned int aux2 = 0;
int bts = floor(x/b_s);
int split = x%b_s;
i = bts;
int j = 0;
while(i<alength){
aux1 = orig[j]>>split;
shifted[i] = aux1|aux2;
aux2 = orig[j]<<(b_s-split);
i++;j++;
}
return shifted;
}
/* Returns true if there is a subset of set[] with sum equal to t */
bool isSubsetSum(int set[],int n, int t){
unsigned int w = sizeof(int)*8; //for regular word.
unsigned int wordsneeded = ceil(double(t+1)/w);
unsigned int elements = n;
//Create table
unsigned int table[elements][wordsneeded];
int c,i;
//Initialize first row
for(i=0;i<wordsneeded;i++){
table[0][i] = 0;
}
table[0][0] = 1<<(w-1);
//Fill the table in bottom up manner
int es,ss,ai;
for(c=1;c<=elements; c++){
unsigned int *aux = nbars(table[c-1],set[c-1]);
for(i=0;i<wordsneeded;i++){
table[c][i] = table[c-1][i]|aux[i];
}
}
if((table[elements][wordsneeded-1]>>((w*wordsneeded)-t-1))&1 ==1){
return true;
}return false;
}
int main(){
int set[] = {81,80,43,40,30,26,12,11,9};
//int sum = 63;
int sum = 1000;
int n = sizeof(set)/sizeof(set[0]);
if (isSubsetSum(set,n,sum) == true)
printf("\nFound a subset with given sum\n");
else
printf("\nNo subset with given sum\n");
return 0;
}
Ok. so If I run the example with a target sum of 63, it works just fine. gives the right answer , True, and if I run code to print the subset it prints the right subset. however, if I change the sum to a larger one, say 1000 like in the code, I get the following error:
Program received signal SIGSEGV, Segmentation fault.
0x0000000000400af1 in isSubsetSum (set=0x0, n=0, t=0) at redss.cpp:63
63 unsigned int *aux = nbars(table[c-1],set[c-1]);
from gdb. I really don't understand why it would fail only for larger sums, since the process should be the same... am I missing something obvious? Any help would be great!!!

Efficient Add of Values with Overflow Protection in C/C++

What is the most efficient way to add two scalars in c/c++ with overflow protection? For example, adding two unsigned chars is 255 if a+b >= 255.
I have:
unsigned char inline add_o(unsigned char x, unsigned char y)
{
const short int maxVal = 255;
unsigned short int s_short = (unsigned short int) x + (unsigned short int) y;
unsigned char s_char = (s_short <= maxVal) ? (unsigned char)s_short : maxVal;
return s_char;
}
that can be driven by:
unsigned char x = 200;
unsigned char y = 129;
unsigned char mySum = add_o(x,y);
I see some ideas here but I am interested in the fastest way to perform this operation---or at least one that is highly palatable to an optimizing compiler.
For most modern compilers will generate branch-free code for your current solution, which is already fairly good. Few optimisations which are very hardware dependant (x86 in particular) are
replace the comparison by a masked and
try to make the overflow protection if a conditional move.
This is how I would have done it:
unsigned char inline add_o(unsigned char x, unsigned char y) {
unsigned short int s_short = (unsigned short int) x + (unsigned short int) y;
if (s_short & 0xFF00)
s_short = 0xFF;
return s_short;
}
You mean unsigned saturating arithmetic?
unsigned char inline add_o(unsigned char x, unsigned char y) {
unsigned char s = x + y;
s |= (unsigned)(-(s < x));
return s;
}
The most efficient way is to pre-fill a table with all possible results, then use the addition of x and y to index into that table.
#include <iostream>
unsigned char add_o_results[255+255];
void pre_fill() {
for (int i = 0 ; i < 255 + 255 ; ++i) {
add_o_results[i] = std::min(i, 255);
}
}
unsigned char inline add_o(unsigned char x, unsigned char y)
{
return add_o_results[x+y];
}
using namespace std;
int main()
{
pre_fill();
cout << int(add_o(150, 151)) << endl;
cout << int(add_o(10, 150)) << endl;
return 0;
}

Printing (Factorial of 2^n)/(2^n -1)mod m in C++

How can we store and print factorial(2^n) / (2^n -1))mod1000000009 in C++.Here n can be as large as 20. When I try to print this using the following code, it shows segmentation fault for n=20
#include
#include
using namespace std;
long int factorial(int n)
{
if(n<=1){return 1;}
else
return (n%1000000009)*(factorial(n-1))%1000000009;
}
int main()
{
int K;
long long int numofmatches=0;
long long int denominator=0;
long long int factor=0;
long long int times=0;
long long int players=0;
cin>>K;
if(K==1)
{
cout<<2<<endl<<2<<endl;
return 0;
}
else
{
denominator=pow(2,K);
cout<<"Denominator="<<denominator<<endl;
numofmatches=factorial(denominator)%1000000009;
denominator-=1;
cout<<"numberofmatches="<<numofmatches<<endl;
cout<<"Denominator="<<denominator<<endl;
factor=numofmatches/denominator;
cout<<"Factor="<<factor<<endl;
while(times<=denominator)
{
cout<<(times*factor)<<endl;
++times;
}
}
return 0;
}
First of all, note that (2^n)! / (2^n-1) is equal to (2^n-2)! x 2^n.
Now, (2^20-2)! by itself is already an extremely large number to calculate.
What you can do instead, is to modulo the intermediate result with 1000000009 after every multiplication:
#define MAX ((1<<20)-2)
unsigned long long res = 1;
for (unsigned int i=1; i<=MAX; i++)
res = (res*i)%1000000009;
res = (res*(MAX+2))%1000000009;
If you want to iterate all values of n between 1 and 20, then you can use:
#define MAX_N 20
unsigned int arr[MAX_N+1] = {0};
void Func()
{
unsigned int i = 1;
unsigned long long res = 1;
for (int n=1; n<=MAX_N; n++)
{
unsigned int max = (1<<n)-2;
for (; i<=max; i++)
res = (res*i)%1000000009;
arr[n] = (unsigned int)((res*(max+2))%1000000009);
}
}
BTW, for any n larger than 29 the result will simply be 0, as (2^30-2) is larger than 1000000009.
So (2^30-2)! is divisible by 1000000009, and therefore, (2^30-2)! mod 1000000009 equals 0.

Using fast Intel random generator(SSE2) fails with stack around ... is corrupted

I need very fast(fastest) random generator. I found this one from Intel: Fast Intel Random Number Generator
Looks good. So i created project at MS Visual Studio 2013:
//FastRandom.h:
#pragma once
#include "emmintrin.h"
#include <time.h>
//define this if you wish to return values similar to the standard rand();
#define COMPATABILITY
namespace Brans
{
__declspec(align(16)) static __m128i cur_seed;
// uncoment this if you are using intel compiler
// for MS CL the vectorizer is on by default and jumps in if you
// compile with /O2 ...
//#pragma intel optimization_parameter target_arch=avx
//__declspec(cpu_dispatch(core_2nd_gen_avx, core_i7_sse4_2, core_2_duo_ssse3, generic )
inline void rand_sse(unsigned int* result)
{
__declspec(align(16)) __m128i cur_seed_split;
__declspec(align(16)) __m128i multiplier;
__declspec(align(16)) __m128i adder;
__declspec(align(16)) __m128i mod_mask;
__declspec(align(16)) __m128i sra_mask;
__declspec(align(16)) __m128i sseresult;
__declspec(align(16)) static const unsigned int mult[4] =
{ 214013, 17405, 214013, 69069 };
__declspec(align(16)) static const unsigned int gadd[4] =
{ 2531011, 10395331, 13737667, 1 };
__declspec(align(16)) static const unsigned int mask[4] =
{ 0xFFFFFFFF, 0, 0xFFFFFFFF, 0 };
__declspec(align(16)) static const unsigned int masklo[4] =
{ 0x00007FFF, 0x00007FFF, 0x00007FFF, 0x00007FFF };
adder = _mm_load_si128((__m128i*) gadd);
multiplier = _mm_load_si128((__m128i*) mult);
mod_mask = _mm_load_si128((__m128i*) mask);
sra_mask = _mm_load_si128((__m128i*) masklo);
cur_seed_split = _mm_shuffle_epi32(cur_seed, _MM_SHUFFLE(2, 3, 0, 1));
cur_seed = _mm_mul_epu32(cur_seed, multiplier);
multiplier = _mm_shuffle_epi32(multiplier, _MM_SHUFFLE(2, 3, 0, 1));
cur_seed_split = _mm_mul_epu32(cur_seed_split, multiplier);
cur_seed = _mm_and_si128(cur_seed, mod_mask);
cur_seed_split = _mm_and_si128(cur_seed_split, mod_mask);
cur_seed_split = _mm_shuffle_epi32(cur_seed_split, _MM_SHUFFLE(2, 3, 0, 1));
cur_seed = _mm_or_si128(cur_seed, cur_seed_split);
cur_seed = _mm_add_epi32(cur_seed, adder);
#ifdef COMPATABILITY
// Add the lines below if you wish to reduce your results to 16-bit vals...
sseresult = _mm_srai_epi32(cur_seed, 16);
sseresult = _mm_and_si128(sseresult, sra_mask);
_mm_storeu_si128((__m128i*) result, sseresult);
return;
#endif
_mm_storeu_si128((__m128i*) result, cur_seed);
return;
}
inline void srand_sse(unsigned int seed)
{
cur_seed = _mm_set_epi32(seed, seed + 1, seed, seed + 1);
}
inline void srand_sse()
{
unsigned int seed = (unsigned int)time(0);
cur_seed = _mm_set_epi32(seed, seed + 1, seed, seed + 1);
}
inline unsigned int GetRandom(unsigned int low, unsigned int high)
{
unsigned int ret = 0;
rand_sse(&ret);
return ret % (high - low + 1) + low;
}
};
// Test.cpp : Defines the entry point for the console application.
//
#include "stdafx.h"
#include "FastRandom.h"
#include <iostream>
using namespace Brans;
int _tmain(int argc, _TCHAR* argv[])
{
srand_sse();
unsigned int result = 0;
for (size_t i = 0; i < 10000; i++)
{
result += GetRandom(1, 50);
result -= GetRandom(1, 50);
}
std::cout << result << std::endl;
return 0;
}
I expect 0 result +- 50. But when i run program in Debug, i got:
Run-Time Check Failure #2 - Stack around the variable 'ret' was corrupted. at GetRandom(...). When i run it in release i got undefined result, up to max unsigned int. (I am using intel i5 processor).
What is wrong?
=========
Add to accepted answer, also i have mistake that i should use long instead of unsigned int because negative result became large positive for unsigned.
From the docs of Intel Fast Random Generator:
The rand_sse() function implements a vectorized version of this fast_rand() function, where the integer math operations are done in fours, using the SIMD architecture.
It means rand_sse generates you 4 random numbers at once using sse2.
So you need to give it array of unsigned int's:
unsigned int result[4];
rand_sse( result );
This instruction:
_mm_storeu_si128((__m128i*) result, cur_seed);
Forcibly casts result, an unsigned int* to an __m128i* and then writes a 128-bit value there. An unsigned int cannot accomodate a 128-bit value, so you end up corrupting the stack around the call site, in GetRandom:
unsigned int ret = 0;
rand_sse(&ret);