Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I'm optimizing the function, I try every way and even sse, and modified code to return from different position to see the calculate timespan but finally I found most of the time spends on the bool judgement. Even I replace all code in the if statement with a simple add operation in it, it still cost 6000ms.
My platform is gcc 4.7.1 e5506 cpu. Its input 'a' and 'b' is a 1000size int array, and 'asize', 'bsize' are corresponding array size. MATCH_MASK = 16383, I run the function 100000 times to statistics a timespan. Is there any good idea to the problem. Thank you!
if (aoffsets[i] && boffsets[i]) // this line costs most time
Code:
uint16_t aoffsets[DOUBLE_MATCH_MASK] = {0}; // important! or it will only be right on the first time
uint16_t* boffsets = aoffsets + MATCH_MASK;
uint8_t* seen = (uint8_t *)aoffsets;
auto fn_init_offsets = [](const int32_t* x, int n_size, uint16_t offsets[])->void
{
for (int i = 0; i < n_size; ++i)
offsets[MATCH_STRIP(x[i])] = i;
};
fn_init_offsets(a, asize, aoffsets);
fn_init_offsets(b, bsize, boffsets);
uint8_t topcount = 0;
int topoffset = 0;
{
std::vector<uint8_t> count_vec(asize + bsize + 1, 0); // it's the fastest way already, very near to tls
uint8_t* counts = &(count_vec[0]);
//return aoffsets[0]; // cost 1375 ms
for (int i = 0; i < MATCH_MASK; ++i)
{
if (aoffsets[i] && boffsets[i]) // this line costs most time
{
//++affsets[i]; // for test
int offset = (aoffsets[i] -= boffsets[i]);
if ((-n_maxoffset <= offset && offset <= n_maxoffset))
{
offset += bsize;
uint8_t n_cur_count = ++counts[offset];
if (n_cur_count > topcount)
{
topcount = n_cur_count;
topoffset = offset;
}
}
}
}
}
return aoffsets[0]; // cost 6000ms
First, memset(count_vec,0, N); of a memaligned buffer wins over std::vector by 30%.
You can try to use the branchless expression (aoffsets[i] * boffsets[i]) and calculate some of not-to-be-used expressions simultaneously: offset = aoffset[i]-boffset[i]; offset+bsize; offset+n_maxoffset;.
Depending on the typical range of offset, one could be tempted to calculate the min/max of (offset+bsize) to restrict the needed memset(count_vec) at the next iteration: no need to clear already zero values.
As pointed by philipp, it's good to interleave the operations -- then again, one can read both aoffset[i] and boffset[i] simultaneously from uint32_t aboffset[N]; with some clever bit masking (that generates change mask for: aoffset[i], aoffset[i+1]) one could possibly handle 2 sets in parallel using 64-bit simulated SIMD in pure c (up to the histogram accumulation part).
You can increase the speed of your program by reducing the cache misses: aoffsets[i] and boffsets[i] are relatively far away from each other in memory. By placing them next to each other, you speed up the program significantly. On my machine (e5400 cpu, VS2012) the execution time is reduced from 3.0 seconds to 2.3 seconds:
#include <vector>
#include <windows.h>
#include <iostream>
typedef unsigned short uint16_t;
typedef int int32_t;
typedef unsigned int uint32_t;
typedef unsigned char uint8_t;
#define MATCH_MASK 16383
#define DOUBLE_MATCH_MASK (MATCH_MASK*2)
static const int MATCH_BITS = 14;
static const int MATCH_LEFT = (32 - MATCH_BITS);
#define MATCH_STRIP(x) ((uint32_t)(x) >> MATCH_LEFT)
static const int n_maxoffset = 1000;
uint16_t test(int32_t* a, int asize, int32_t* b, int bsize)
{
uint16_t offsets[DOUBLE_MATCH_MASK] = {0};
auto fn_init_offsets = [](const int32_t* x, int n_size, uint16_t offsets[])->void
{
for (int i = 0; i < n_size; ++i)
offsets[MATCH_STRIP(x[i])*2 /*important. leave space for other offsets*/] = i;
};
fn_init_offsets(a, asize, offsets);
fn_init_offsets(b, bsize, offsets+1);
uint8_t topcount = 0;
int topoffset = 0;
{
std::vector<uint8_t> count_vec(asize + bsize + 1, 0);
uint8_t* counts = &(count_vec[0]);
for (int i = 0; i < MATCH_MASK; i+=2)
{
if (offsets[i] && offsets[i+1])
{
int offset = (offsets[i] - offsets[i+1]); //NOTE: I removed
if ((-n_maxoffset <= offset && offset <= n_maxoffset))
{
offset += bsize;
uint8_t n_cur_count = ++counts[offset];
if (n_cur_count > topcount)
{
topcount = n_cur_count;
topoffset = offset;
}
}
}
}
}
return offsets[0];
}
int main(int argc, char* argv[])
{
const int sizes = 1000;
int32_t* a = new int32_t[sizes];
int32_t* b = new int32_t[sizes];
for (int i=0;i<sizes;i++)
{
a[i] = rand()*rand();
b[i] = rand()*rand();
}
//Variablen
LONGLONG g_Frequency, g_CurentCount, g_LastCount;
QueryPerformanceFrequency((LARGE_INTEGER*)&g_Frequency);
QueryPerformanceCounter((LARGE_INTEGER*)&g_CurentCount);
int sum = 0;
for (int i=0;i<100000;i++)
{
sum += test(a,sizes,b,sizes);
}
QueryPerformanceCounter((LARGE_INTEGER*)&g_LastCount);
double dTimeDiff = (((double)(g_LastCount-g_CurentCount))/((double)g_Frequency));
std::cout << "Result: " << sum << std::endl <<"time: " << dTimeDiff << std::endl;
delete[] a;
delete[] b;
return 0;
}
compared to your version of test().
#include <vector>
#include <windows.h>
#include <iostream>
typedef unsigned short uint16_t;
typedef int int32_t;
typedef unsigned int uint32_t;
typedef unsigned char uint8_t;
#define MATCH_MASK 16383
#define DOUBLE_MATCH_MASK (MATCH_MASK*2)
static const int MATCH_BITS = 14;
static const int MATCH_LEFT = (32 - MATCH_BITS);
#define MATCH_STRIP(x) ((uint32_t)(x) >> MATCH_LEFT)
static const int n_maxoffset = 1000;
uint16_t test(int32_t* a, int asize, int32_t* b, int bsize)
{
uint16_t aoffsets[DOUBLE_MATCH_MASK] = {0}; // important! or it will only be right on the first time
uint16_t* boffsets = aoffsets + MATCH_MASK;
auto fn_init_offsets = [](const int32_t* x, int n_size, uint16_t offsets[])->void
{
for (int i = 0; i < n_size; ++i)
offsets[MATCH_STRIP(x[i])] = i;
};
fn_init_offsets(a, asize, aoffsets);
fn_init_offsets(b, bsize, boffsets);
uint8_t topcount = 0;
int topoffset = 0;
{
std::vector<uint8_t> count_vec(asize + bsize + 1, 0);
uint8_t* counts = &(count_vec[0]);
for (int i = 0; i < MATCH_MASK; ++i)
{
if (aoffsets[i] && boffsets[i])
{
int offset = (aoffsets[i] - boffsets[i]); //NOTE: I removed the -= because otherwise offset would always be positive!
if ((-n_maxoffset <= offset && offset <= n_maxoffset))
{
offset += bsize;
uint8_t n_cur_count = ++counts[offset];
if (n_cur_count > topcount)
{
topcount = n_cur_count;
topoffset = offset;
}
}
}
}
}
return aoffsets[0];
}
int main(int argc, char* argv[])
{
const int sizes = 1000;
int32_t* a = new int32_t[sizes];
int32_t* b = new int32_t[sizes];
for (int i=0;i<sizes;i++)
{
a[i] = rand()*rand();
b[i] = rand()*rand();
}
LONGLONG g_Frequency, g_CurentCount, g_LastCount;
QueryPerformanceFrequency((LARGE_INTEGER*)&g_Frequency);
QueryPerformanceCounter((LARGE_INTEGER*)&g_CurentCount);
int sum = 0;
for (int i=0;i<100000;i++)
{
sum += test(a,sizes,b,sizes);
}
QueryPerformanceCounter((LARGE_INTEGER*)&g_LastCount);
double dTimeDiff = (((double)(g_LastCount-g_CurentCount))/((double)g_Frequency));
std::cout << "Result: " << sum << std::endl <<"time: " << dTimeDiff << std::endl;
delete[] a;
delete[] b;
return 0;
}
Related
I have the following problem : I have a hex number (datatype : std::uint64_t) in C++, and the hex number contains all the digits from 1 to a given n. The question is now to rotate the first k digits of the hex number, for example :
hex = 0x436512, k = 3 --> 0x634512
I have already tried splitting the hex number into two parts, e.g
std::uint64_t left = hex >> ((n - k) * 4);
std::uint64_t right = ((1UL << ((n - k) * 4)) - 1) & hex;
and then rotating left and merging left and right together. Is there a possibility to do this in-place and by only using bit-manipulation and/or mathematical operators?
As a baseline you can use this which is basically converting to digits and back.
#include <cstdio>
#include <cstdint>
#include <array>
uint64_t rotate( uint64_t value, int k ) {
// decompose
std::array<uint8_t,16> digits;
int numdigits = 0;
while ( value > 0 ) {
digits[numdigits] = value % 16;
value = value / 16;
numdigits += 1;
}
if ( k>numdigits ) return 0;
// revert digits
int p1 = numdigits - 1;
int p2 = numdigits - k;
for ( ; p1>p2; p1--,p2++ ) {
uint8_t tmp = digits[p1];
digits[p1] = digits[p2];
digits[p2] = tmp;
}
// reconstruct
for ( int j=0; j<numdigits; ++j ) {
value = (value*16) + digits[numdigits-1-j];
}
return value;
}
int main() {
uint64_t value = 0x123ffffff;
for ( int j=0; j<8; ++j ) {
value = value >> 4;
printf( "%lx %lx\n", value, rotate(value,3) );
}
}
Godbolt: https://godbolt.org/z/77qWEo9vE
It produces:
Program stdout
123fffff 321fffff
123ffff 321ffff
123fff 321fff
123ff 321ff
123f 321f
123 321
12 0
1 0
You actually do not need to decompose the entire number, you can strictly decompose only the left digits.
#include <cstdio>
#include <cstdint>
#include <array>
uint64_t rotate( uint64_t value, int k ) {
// sanity check
if ( value == 0 ) return 0;
// fast find number of digits
int numdigits = (63-__builtin_clzl(value))/4 + 1;
if ( k>numdigits ) return 0;
// Decompose left and right
int rightbits = 4*(numdigits-k);
uint64_t left = value >> rightbits;
uint64_t right = rightbits==0 ? 0 : value & (uint64_t(-1)>>(64-rightbits));
// decompose left
uint64_t rot = 0;
for ( int j=0; j<k; ++j ) {
uint64_t digit = left % 16;
left = left / 16;
rot = (rot*16) + digit;
}
// rejoin
return right | (rot<<rightbits);
}
int main() {
uint64_t value = 0x123ffffff;
for ( int j=0; j<8; ++j ) {
value = value >> 4;
printf( "%lx %lx\n", value, rotate(value,3) );
}
}
Produces the same output.
Godbolt: https://godbolt.org/z/P3z6W8b3M
Running under Google benchmark:
#include <benchmark/benchmark.h>
#include <vector>
#include <iostream>
uint64_t rotate1(uint64_t value, int k);
uint64_t rotate2(uint64_t value, int k);
struct RotateTrivial {
uint64_t operator()(uint64_t value, int k) {
return rotate1(value, k);
}
};
struct RotateLeftOnly {
uint64_t operator()(uint64_t value, int k) {
return rotate2(value, k);
}
};
template <typename Fn>
static void Benchmark(benchmark::State& state) {
Fn fn;
for (auto _ : state) {
uint64_t value = uint64_t(-1);
for (int j = 0; j < 16; ++j) {
for (int k = 1; k < j; ++k) {
uint64_t result = fn(value, k);
benchmark::DoNotOptimize(result);
}
value = value >> 4;
}
}
}
BENCHMARK(Benchmark<RotateTrivial>);
BENCHMARK(Benchmark<RotateLeftOnly>);
BENCHMARK_MAIN();
Produces on an AMD Threadripper 3960x 3.5GHz
--------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------
Benchmark<RotateTrivial> 619 ns 619 ns 1158174
Benchmark<RotateLeftOnly> 320 ns 320 ns 2222098
Each iteration has 105 calls so it's about 6.3 ns/call or 20 cycles for the trivial version and 3.1ns/call or 10 cycles for the optimized version.
Here's the thing, how can I add two unsigned char arrays and store the result in an unsigned short array by using SSE. Can anyone give me some help or hint. This is what I have done so far. I just don't know where the error is..need some help
#include<iostream>
#include<intrin.h>
#include<windows.h>
#include<emmintrin.h>
#include<iterator>
using namespace std;
void sse_add(unsigned char * input1, unsigned char *input2, unsigned short *output, const int N)
{
unsigned char *op3 = new unsigned char[N];
unsigned char *op4 = new unsigned char[N];
__m128i *sse_op3 = (__m128i*)op3;
__m128i *sse_op4 = (__m128i*)op4;
__m128i *sse_result = (__m128i*)output;
for (int i = 0; i < N; i = i + 16)
{
__m128i src = _mm_loadu_si128((__m128i*)input1);
__m128i zero = _mm_setzero_si128();
__m128i higher = _mm_unpackhi_epi8(src, zero);
__m128i lower = _mm_unpacklo_epi8(src, zero);
_mm_storeu_si128(sse_op3, lower);
sse_op3 = sse_op3 + 1;
_mm_storeu_si128(sse_op3, higher);
sse_op3 = sse_op3 + 1;
input1 = input1 + 16;
}
for (int j = 0; j < N; j = j + 16)
{
__m128i src1 = _mm_loadu_si128((__m128i*)input2);
__m128i zero1 = _mm_setzero_si128();
__m128i higher1 = _mm_unpackhi_epi8(src1, zero1);
__m128i lower1 = _mm_unpacklo_epi8(src1, zero1);
_mm_storeu_si128(sse_op4, lower1);
sse_op4 = sse_op4 + 1;
_mm_storeu_si128(sse_op4, higher1);
sse_op4 = sse_op4 + 1;
input2 = input2 + 16;
}
__m128i *sse_op3_new = (__m128i*)op3;
__m128i *sse_op4_new = (__m128i*)op4;
for (int y = 0; y < N; y = y + 8)
{
*sse_result = _mm_adds_epi16(*sse_op3_new, *sse_op4_new);
sse_result = sse_result + 1;
sse_op3_new = sse_op3_new + 1;
sse_op4_new = sse_op4_new + 1;
}
}
void C_add(unsigned char * input1, unsigned char *input2, unsigned short *output, int N)
{
for (int i = 0; i < N; i++)
output[i] = (unsigned short)input1[i] + (unsigned short)input2[i];
}
int main()
{
int n = 1023;
unsigned char *p0 = new unsigned char[n];
unsigned char *p1 = new unsigned char[n];
unsigned short *p21 = new unsigned short[n];
unsigned short *p22 = new unsigned short[n];
for (int j = 0; j < n; j++)
{
p21[j] = rand() % 256;
p22[j] = rand() % 256;
}
C_add(p0, p1, p22, n);
cout << "C_add finished!" << endl;
sse_add(p0, p1, p21, n);
cout << "sse_add finished!" << endl;
for (int j = 0; j < n; j++)
{
if (p21[j] != p22[j])
{
cout << "diff!!!!!#######" << endl;
}
}
//system("pause");
delete[] p0;
delete[] p1;
delete[] p21;
delete[] p22;
return 0;
}
Assuming everything is aligned to _Alignof(__m128i) and the size of the array is a multiple of sizeof(__m128i), something like this should work:
void addw(size_t size, uint16_t res[size], uint8_t a[size], uint8_t b[size]) {
__m128i* r = (__m128i*) res;
__m128i* ap = (__m128i*) a;
__m128i* bp = (__m128i*) b;
for (size_t i = 0 ; i < (size / sizeof(__m128i)) ; i++) {
r[(i * 2)] = _mm_add_epi16(_mm_cvtepu8_epi16(ap[i]), _mm_cvtepu8_epi16(bp[i]));
r[(i * 2) + 1] = _mm_add_epi16(_mm_cvtepu8_epi16(_mm_srli_si128(ap[i], 8)), _mm_cvtepu8_epi16(_mm_srli_si128(bp[i], 8)));
}
}
FWIW, NEON would be a bit simpler (using vaddl_u8 and vaddl_high_u8).
If you're dealing with unaligned data you can use _mm_loadu_si128/_mm_storeu_si128. If size isn't a multiple of 16 you'll just have to do the remainder without SSE.
Note that this may be something your compiler can do automatically (I haven't checked). You may want to try something like this:
#pragma omp simd
for (size_t i = 0 ; i < size ; i++) {
res[i] = ((uint16_t) a[i]) + ((uint16_t) b[i]);
}
That uses OpenMP 4, but there is also Cilk++ (#pragma simd), clang (#pragma clang loop vectorize(enable)), gcc (#pragma GCC ivdep), or you could just hope the compiler is smart enough without the pragma hint.
I have two vector<bool> A and B.
I want to compare them and count the number of elements that are equal:
For example:
A = {0,1,0,1}
B = {0,0,1,1}
Result will be equal to 2.
I can use _mm_cmpeq_epi8 but it is only compare 16 elements (i.e. I should convert 0 and 1 to char and then do the comparison).
Is it possible to compare 128 elements each time with SSE (or SIMD instructions)?
If you can either assume that vector<bool> is using contiguous byte-sized elements for storage, or if you can consider using something like vector<uint8_t> instead, then this example should give you a good starting point:
static size_t count_equal(const vector<uint8_t> &vec1, const vector<uint8_t> &vec2)
{
assert(vec1.size() == vec2.size()); // vectors must be same size
const size_t n = vec1.size();
const size_t max_block_size = 255 * 16; // max block size before possible overflow
__m128i vcount = _mm_setzero_si128();
size_t i, count = 0;
for (i = 0; i + 16 <= n; ) // for each block
{
size_t m = std::min(n, i + max_block_size);
for ( ; i + 16 <= m; i += 16) // for each vector in block
{
__m128i v1 = _mm_loadu_si128((__m128i *)&vec1[i]);
__m128i v2 = _mm_loadu_si128((__m128i *)&vec2[i]);
__m128i vcmp = _mm_cmpeq_epi8(v1, v2);
vcount = _mm_sub_epi8(vcount, vcmp);
}
vcount = _mm_sad_epu8(vcount, _mm_setzero_si128());
count += _mm_extract_epi16(vcount, 0) + _mm_extract_epi16(vcount, 4);
vcount = _mm_setzero_si128(); // update count from current block
}
vcount = _mm_sad_epu8(vcount, _mm_setzero_si128());
count += _mm_extract_epi16(vcount, 0) + _mm_extract_epi16(vcount, 4);
for ( ; i < n; ++i) // deal with any remaining partial vector
{
count += (vec1[i] == vec2[i]);
}
return count;
}
Note that this is using vector<uint8_t>. If you really have to use vector<bool> and can guarantee that the elements will always be contiguous and byte-sized then you'll just need to coerce the vector<bool> into a const uint8_t * or similar somehow.
Test harness:
#include <cassert>
#include <cstdlib>
#include <ctime>
#include <iostream>
#include <vector>
#include <emmintrin.h> // SSE2
using std::vector;
static size_t count_equal_ref(const vector<uint8_t> &vec1, const vector<uint8_t> &vec2)
{
assert(vec1.size() == vec2.size());
const size_t n = vec1.size();
size_t i, count = 0;
for (i = 0 ; i < n; ++i)
{
count += (vec1[i] == vec2[i]);
}
return count;
}
static size_t count_equal(const vector<uint8_t> &vec1, const vector<uint8_t> &vec2)
{
assert(vec1.size() == vec2.size()); // vectors must be same size
const size_t n = vec1.size();
const size_t max_block_size = 255 * 16; // max block size before possible overflow
__m128i vcount = _mm_setzero_si128();
size_t i, count = 0;
for (i = 0; i + 16 <= n; ) // for each block
{
size_t m = std::min(n, i + max_block_size);
for ( ; i + 16 <= m; i += 16) // for each vector in block
{
__m128i v1 = _mm_loadu_si128((__m128i *)&vec1[i]);
__m128i v2 = _mm_loadu_si128((__m128i *)&vec2[i]);
__m128i vcmp = _mm_cmpeq_epi8(v1, v2);
vcount = _mm_sub_epi8(vcount, vcmp);
}
vcount = _mm_sad_epu8(vcount, _mm_setzero_si128());
count += _mm_extract_epi16(vcount, 0) + _mm_extract_epi16(vcount, 4);
vcount = _mm_setzero_si128(); // update count from current block
}
vcount = _mm_sad_epu8(vcount, _mm_setzero_si128());
count += _mm_extract_epi16(vcount, 0) + _mm_extract_epi16(vcount, 4);
for ( ; i < n; ++i) // deal with any remaining partial vector
{
count += (vec1[i] == vec2[i]);
}
return count;
}
int main(int argc, char * argv[])
{
size_t n = 100;
if (argc > 1)
{
n = atoi(argv[1]);
}
vector<uint8_t> vec1(n);
vector<uint8_t> vec2(n);
srand((unsigned int)time(NULL));
for (size_t i = 0; i < n; ++i)
{
vec1[i] = rand() & 1;
vec2[i] = rand() & 1;
}
size_t n_ref = count_equal_ref(vec1, vec2);
size_t n_test = count_equal(vec1, vec2);
if (n_ref == n_test)
{
std::cout << "PASS" << std::endl;
}
else
{
std::cout << "FAIL: n_ref = " << n_ref << ", n_test = " << n_test << std::endl;
}
return 0;
}
Compile and run:
$ g++ -Wall -msse3 -O3 test.cpp && ./a.out
PASS
std::vector<bool> is a specialization of std::vector for the type bool. Although not specified by the C++ standard, in most implementations std::vector<bool> is made space efficient such that each of its element is a single bit instead of a bool.
The behaviour of std::vector<bool> is similar to its primarily template counterpart, except that:
std::vector<bool> does not necessarily store its element contiguously .
In order to expose its elements (i.e., the individual bits) std::vector<bool> uses a proxy class (i.e., std::vector<bool>::reference). Objects of class std::vector<bool>::reference are returned by std::vector<bool> subscript operator (i.e., operator[]) by value.
Accordingly, I don't think it's portable to use _mm_cmpeq_epi8 like functions since storage of a std::vector<bool> is implementation defined (i.e., not guaranteed contiguous).
An alternative but portable way is to use regular STL facilities like the example below:
std::vector<bool> A = {0,1,0,1};
std::vector<bool> B = {0,0,1,1};
std::vector<bool> C(A.size());
std::transform(A.begin(), A.end(), B.begin(), C.begin(), [](bool const &a, bool const &b) { return a == b;});
std::cout << std::count(C.begin(), C.end(), true) << std::endl;
Live Demo
I'm getting started in multithreaded programming so please excuse me if the following seems obvious. I am adding multithreading to an image processing program and the speedup isn't exactly the one I expected.
I'm currently getting a speedup of 4x times on a 4 physical processor cpu with hyperthreading (8), so I'd like to know if this kind of speedup is expected. The only thing I can think of is that it may make sense if both hyperthreads of a single physical CPU have to share some sort of memory bus.
Being new to multithreading it's not entirely clear to me if this would be considered an I/O bound program considering that all memory is allocated in RAM (I understand that the virtual memory manager of my OS will be the one deciding to page in/out this supposed memory amount from the heap) My machine has 16Gb of RAM in case it helps deciding if paging/swapping can be an issue.
I've written a test program showcasing the serial case and two parallel cases using QThreadPool and tbb::parallel_for
The current program as you can see has no real operations other than setting a supposed image from black to white and it's done on purpose to know what the baseline is before any real operations are applied to the image.
I'm attaching the program in hope that someone can explain me if my quest for a roughly 8x speedup is a lost cause in this kind of processing algorithm. Note that I'm not interested in other kinds of optimizations such as SIMD as my real concern is not just to make it faster, but to make it faster using purely multithreading, without getting into SSE nor processor cache level optimizations.
#include <iostream>
#include <sys/time.h>
#include <vector>
#include <QThreadPool>
#include "/usr/local/include/tbb/tbb.h"
#define LOG(x) (std::cout << x << std::endl)
struct col4
{
unsigned char r, g, b, a;
};
class QTileTask : public QRunnable
{
public:
void run()
{
for(uint32_t y = m_yStart; y < m_yEnd; y++)
{
int rowStart = y * m_width;
for(uint32_t x = m_xStart; x < m_xEnd; x++)
{
int index = rowStart + x;
m_pData[index].r = 255;
m_pData[index].g = 255;
m_pData[index].b = 255;
m_pData[index].a = 255;
}
}
}
col4* m_pData;
uint32_t m_xStart;
uint32_t m_yStart;
uint32_t m_xEnd;
uint32_t m_yEnd;
uint32_t m_width;
};
struct TBBTileTask
{
void operator()()
{
for(uint32_t y = m_yStart; y < m_yEnd; y++)
{
int rowStart = y * m_width;
for(uint32_t x = m_xStart; x < m_xEnd; x++)
{
int index = rowStart + x;
m_pData[index].r = 255;
m_pData[index].g = 255;
m_pData[index].b = 255;
m_pData[index].a = 255;
}
}
}
col4* m_pData;
uint32_t m_xStart;
uint32_t m_yStart;
uint32_t m_xEnd;
uint32_t m_yEnd;
uint32_t m_width;
};
struct TBBCaller
{
TBBCaller(std::vector<TBBTileTask>& t)
: m_tasks(t)
{}
TBBCaller(TBBCaller& e, tbb::split)
: m_tasks(e.m_tasks)
{}
void operator()(const tbb::blocked_range<size_t>& r) const
{
for (size_t i=r.begin();i!=r.end();++i)
m_tasks[i]();
}
std::vector<TBBTileTask>& m_tasks;
};
inline double getcurrenttime( void )
{
timeval t;
gettimeofday(&t, NULL);
return static_cast<double>(t.tv_sec)+(static_cast<double>(t.tv_usec) / 1000000.0);
}
char* getCmdOption(char ** begin, char ** end, const std::string & option)
{
char ** itr = std::find(begin, end, option);
if (itr != end && ++itr != end)
{
return *itr;
}
return 0;
}
bool cmdOptionExists(char** begin, char** end, const std::string& option)
{
return std::find(begin, end, option) != end;
}
void baselineSerial(col4* pData, int resolution)
{
double t = getcurrenttime();
for(int y = 0; y < resolution; y++)
{
int rowStart = y * resolution;
for(int x = 0; x < resolution; x++)
{
int index = rowStart + x;
pData[index].r = 255;
pData[index].g = 255;
pData[index].b = 255;
pData[index].a = 255;
}
}
LOG((getcurrenttime() - t) * 1000 << " ms. (Serial)");
}
void baselineParallelQt(col4* pData, int resolution, uint32_t tileSize)
{
double t = getcurrenttime();
QThreadPool pool;
for(int y = 0; y < resolution; y+=tileSize)
{
for(int x = 0; x < resolution; x+=tileSize)
{
uint32_t xEnd = std::min<uint32_t>(x+tileSize, resolution);
uint32_t yEnd = std::min<uint32_t>(y+tileSize, resolution);
QTileTask* t = new QTileTask;
t->m_pData = pData;
t->m_xStart = x;
t->m_yStart = y;
t->m_xEnd = xEnd;
t->m_yEnd = yEnd;
t->m_width = resolution;
pool.start(t);
}
}
pool.waitForDone();
LOG((getcurrenttime() - t) * 1000 << " ms. (QThreadPool)");
}
void baselineParallelTBB(col4* pData, int resolution, uint32_t tileSize)
{
double t = getcurrenttime();
std::vector<TBBTileTask> tasks;
for(int y = 0; y < resolution; y+=tileSize)
{
for(int x = 0; x < resolution; x+=tileSize)
{
uint32_t xEnd = std::min<uint32_t>(x+tileSize, resolution);
uint32_t yEnd = std::min<uint32_t>(y+tileSize, resolution);
TBBTileTask t;
t.m_pData = pData;
t.m_xStart = x;
t.m_yStart = y;
t.m_xEnd = xEnd;
t.m_yEnd = yEnd;
t.m_width = resolution;
tasks.push_back(t);
}
}
TBBCaller caller(tasks);
tbb::task_scheduler_init init;
tbb::parallel_for(tbb::blocked_range<size_t>(0, tasks.size()), caller);
LOG((getcurrenttime() - t) * 1000 << " ms. (TBB)");
}
int main(int argc, char** argv)
{
int resolution = 1;
uint32_t tileSize = 64;
char * pResText = getCmdOption(argv, argv + argc, "-r");
if (pResText)
{
resolution = atoi(pResText);
}
char * pTileSizeChr = getCmdOption(argv, argv + argc, "-b");
if (pTileSizeChr)
{
tileSize = atoi(pTileSizeChr);
}
if(resolution > 16)
resolution = 16;
resolution = resolution << 10;
uint32_t tileCount = resolution/tileSize + 1;
tileCount *= tileCount;
LOG("Resolution: " << resolution << " Tile Size: "<< tileSize);
LOG("Tile Count: " << tileCount);
uint64_t pixelCount = resolution*resolution;
col4* pData = new col4[pixelCount];
memset(pData, 0, sizeof(col4)*pixelCount);
baselineSerial(pData, resolution);
memset(pData, 0, sizeof(col4)*pixelCount);
baselineParallelQt(pData, resolution, tileSize);
memset(pData, 0, sizeof(col4)*pixelCount);
baselineParallelTBB(pData, resolution, tileSize);
delete[] pData;
return 0;
}
Yes, 4x speedup is expected. Hypertreading is a kind of time sharing implemented in hardware, so you can't expect to benefit from it if one thread is using up all superscalar pipelines available on the core, as it is your case. The other thread will necessarily have to wait.
You can expect an even lower speedup if your memory bus bandwidth is saturated by the threads running in less than the total number of cores available. Usually happens if you have too many cores, like in this question:
Why doesn't this code scale linearly?
Starting from this article - Gallery of Processor Cache Effects by Igor Ostrovsky - I wanted to play with his examples on my own machine.
This is my code for the first example, that looks at how touching different cache lines affect running time:
#include <iostream>
#include <time.h>
using namespace std;
int main(int argc, char* argv[])
{
int step = 1;
const int length = 64 * 1024 * 1024;
int* arr = new int[length];
timespec t0, t1;
clock_gettime(CLOCK_REALTIME, &t0);
for (int i = 0; i < length; i += step)
arr[i] *= 3;
clock_gettime(CLOCK_REALTIME, &t1);
long int duration = (t1.tv_nsec - t0.tv_nsec);
if (duration < 0)
duration = 1000000000 + duration;
cout<< step << ", " << duration / 1000 << endl;
return 0;
}
Using various values for step, I don't see the jump in the running time:
step, microseconds
1, 451725
2, 334981
3, 287679
4, 261813
5, 254265
6, 246077
16, 215035
32, 207410
64, 202526
128, 197089
256, 195154
I would expect to see something similar with:
But from 16 onwards, the running time is halved each time we double the step.
I test it on an Ubuntu13, Xeon X5450 and compiling it with: g++ -O0.
Is something flawed with my code, or the results are actually ok?
Any insight on what I'm missing would be highly appreciated.
As i see you want to observe effect of cache line sizes, i recommend tool cachegrind, part of valgrind tool set. Your approach is right but not close to results.
#include <iostream>
#include <time.h>
#include <stdlib.h>
using namespace std;
int main(int argc, char* argv[])
{
int step = atoi(argv[1]);
const int length = 64 * 1024 * 1024;
int* arr = new int[length];
for (int i = 0; i < length; i += step)
arr[i] *= 3;
return 0;
}
Run tool valgrind --tool=cachegrind ./a.out $cacheline-size and you should see results. After plotting this you will get desired results with accuracy. Happy Experimenting!!
public class CacheLine {
public static void main(String[] args) {
CacheLine cacheLine = new CacheLine();
cacheLine.startTesting();
}
private void startTesting() {
byte[] array = new byte[128 * 1024];
for (int testIndex = 0; testIndex < 10; testIndex++) {
testMethod(array);
System.out.println("--------- // ---------");
}
}
private void testMethod(byte[] array) {
for (int len = 8192; len <= array.length; len += 8192) {
long t0 = System.nanoTime();
for (int i = 0; i < 10000; i++) {
for (int k = 0; k < len; k += 64) {
array[k] = 1;
}
}
long dT = System.nanoTime() - t0;
System.out.println("len: " + len / 1024 + " dT: " + dT + " dT/stepCount: " + (dT) / len);
}
}
}
This code helps you with determining L1 data cache size. You can read about it more in detail here. https://medium.com/#behzodbekqodirov/threading-in-java-194b7db6c1de#.kzt4w8eul