is cvCeil() faster than standard library? - c++

I see that OpenCV implement cvCeil function:
CV_INLINE int cvCeil( double value )
{
#if defined _MSC_VER && defined _M_X64 || (defined __GNUC__ && defined __SSE2__&& !defined __APPLE__)
__m128d t = _mm_set_sd( value );
int i = _mm_cvtsd_si32(t);
return i + _mm_movemask_pd(_mm_cmplt_sd(_mm_cvtsi32_sd(t,i), t));
#elif defined __GNUC__
int i = (int)value;
return i + (i < value);
#else
int i = cvRound(value);
float diff = (float)(i - value);
return i + (diff < 0);
#endif
}
I'm curious in this implementations first part, i.e. the _mm_set_sd related calls. Will they be faster than MSVCRT / libstdc++ / libc++ ? And why?

A simple benchmark below tells me that std::round works more than 3 times faster on my SSE4-enabled machine, but about 2 times slower when SSE4 is not enabled.
#include <cmath>
#include <chrono>
#include <sstream>
#include <iostream>
#include <opencv2/core/fast_math.hpp>
auto currentTime() { return std::chrono::steady_clock::now(); }
template<typename T, typename P>
std::string toString(std::chrono::duration<T,P> dt)
{
std::ostringstream str;
using namespace std::chrono;
str << duration_cast<microseconds>(dt).count()*1e-3 << " ms";
return str.str();
}
int main()
{
volatile double x=34.234;
volatile double y;
constexpr auto MAX_ITER=100'000'000;
const auto t0=currentTime();
for(int i=0;i<MAX_ITER;++i)
y=std::ceil(x);
const auto t1=currentTime();
for(int i=0;i<MAX_ITER;++i)
y=cvCeil(x);
const auto t2=currentTime();
std::cout << "std::ceil: " << toString(t1-t0) << "\n"
"cvCeil : " << toString(t2-t1) << "\n";
}
I test with -O3 option on GCC 8.3.0, glibc-2.27, Ubuntu 18.04.1 x86_64 on Intel Core i7-3930K 3.2 GHz.
Output when compiled with -msse4:
std::ceil: 39.357 ms
cvCeil : 143.224 ms
Output when compiled without -msse4:
std::ceil: 274.945 ms
cvCeil : 146.218 ms
It's easy to understand: SSE4.1 introduces the ROUNDSD instruction, which is basically what std::round does. Before this the compiler has to do some comparison/conditional-moves tricks, and it also has to make sure that these don't overflow. Thus the cvCeil version, sacrificing well-definedness for value>INT_MAX and for value<INT_MIN, gets speedup for the values for which it's well-defined. For others it has undefined behavior (or, with intrinsics, simply gives wrong results).

Related

How to write and call std::hash? - for gmp's mpz_class and mpz_t

I think most of the work is done here, just a little detail is missing at the end. Read on.
I'm trying to write the glue code for using MurmurHash3 to hash big integers (mpz_t and mpz_class) of the GMP library in C++. I do this in order to later use them in a std::unordered_map<mpz_class, int>.
I want the code to compile in a useful way for 32 bit and for 64 bit systems and to be easily extensible when 128 bit systems are required. Therefor I've written the MurmurHash3_size_t() function which calls the right hash function of MurmurHash3 and then converts the result to size_t. I assume that size_t is of the correct bit size in terms of 32/64/128 bit systems. (I don't know if this assumption is useful.) This part of the code compiles nicely.
The problem arises when I want to define the std::hash function. I get a compiler error for my code (see comment in code). How to write these std::hash functions correctly and how to call them?
(click to view MurmurHash3.h)
File hash_mpz.cpp:
#include "hash_mpz.h"
#include <gmpxx.h>
#include "MurmurHash3.h"
size_t MurmurHash3_size_t(const void *key, int len, uint32_t seed) {
#if SIZE_MAX==0xffffffff
size_t result;
MurmurHash3_x86_32(key, len, seed, &result);
return result;
#elif SIZE_MAX==0xffffffffffffffff
size_t result[2];
MurmurHash3_x64_128(key, len, seed, &result);
return result[0] ^ result[1];
#else
#error cannot determine correct version of MurmurHash3, because SIZE_MAX is neither 0xffffffff nor 0xffffffffffffffff
#endif
}
namespace std {
size_t hash<mpz_t>::operator()(const mpz_t &x) const {
// found 1846872219 by randomly hitting digits on my keyboard
return MurmurHash3_size_t(x->_mp_d, x->_mp_size * sizeof(mp_limb_t), 1846872219);
}
size_t hash<mpz_class>::operator()(const mpz_class &x) const {
// compiler error in next statement
// error: no matching function for call to ‘std::hash<__mpz_struct [1]>::operator()(mpz_srcptr)’
return hash<mpz_t>::operator()(x.get_mpz_t());
}
}
Found a solution which works for me:
namespace std {
size_t hash<mpz_srcptr>::operator()(const mpz_srcptr x) const {
// found 1846872219 by randomly typing digits on my keyboard
return MurmurHash3_size_t(x->_mp_d, x->_mp_size * sizeof(mp_limb_t),
1846872219);
}
size_t hash<mpz_t>::operator()(const mpz_t &x) const {
return hash<mpz_srcptr> { }((mpz_srcptr) x);
}
size_t hash<mpz_class>::operator()(const mpz_class &x) const {
return hash<mpz_srcptr> { }(x.get_mpz_t());
}
}
Then you can use the hash function as follows:
#include <iostream>
#include <gmpxx.h>
#include <unordered_map>
#include "hash_mpz.h"
using namespace std;
int main() {
mpz_class a;
mpz_ui_pow_ui(a.get_mpz_t(), 168, 16);
cout << "a : " << a << endl;
cout << "hash(a): " << (hash<mpz_class> { }(a)) << endl;
unordered_map<mpz_class, int> map;
map[a] = 2;
cout << "map[a] : " << map[a] << endl;
return 0;
}
Output:
a : 402669288768856477614113920779288576
hash(a): 11740158581999522595
map[a] : 2
Comments are appreciated.

Optimal branchless conditional selection of two SSE2 packed doubles

I'm trying to write a branchless bit select function for packed SSE2 doubles:
#include <iostream>
#include <emmintrin.h>
inline __m128d select(bool expression, const __m128d& x, const __m128d& y)
{
const int conditional_mask = expression ? -1 : 0;
const auto mask = _mm_castsi128_pd(_mm_set_epi64x(conditional_mask, conditional_mask));
return _mm_or_pd(_mm_and_pd(mask, x), _mm_andnot_pd(mask, y));
}
int main()
{
auto r1 = _mm_setr_pd(1, 2);
auto r2 = _mm_setr_pd(5, 6);
auto result = select(true, r1, r2);
auto packed = reinterpret_cast<double*>(&result);
std::cout << "result = " << packed[0] << ", " << packed[1] << std::endl;
std::getchar();
return EXIT_SUCCESS;
}
Is there a simpler approach for SSE2 and SSE4 that would be more optimal on x64?
You've specified that SSE4 is allowed, SSE4.1 has blendvpd so you can blend with a built-in blend: (not tested, but compiled)
inline __m128d select(bool expression, const __m128d& x, const __m128d& y)
{
const int c_mask = expression ? -1 : 0;
const auto mask = _mm_castsi128_pd(_mm_set_epi64x(c_mask, c_mask));
return _mm_blendv_pd(y, x, mask);
}
I would also not take SSE vectors as argument by reference, copying them is trivial so not something to be avoided and taking them by reference encourages the compiler to bounce them through memory (for non-inlined calls).

CUDA Thrust Functor with Flexibility to Run in CPU or GPU

This might be a stupid question, but I cannot seem to be able to find any resources specifically related to it, so your opinion is appreciated.
Let's say I have some functor
struct AddOne {
thrust::device_ptr<int> numbers;
__device__
void operator()(int i) {
numbers[i] = numbers[i] + 1;
}
}
that i can call from
AddOne addOneFunctor;
thrust::device_vector<int> idx(100), numbers(100);
addOneFunctor.numbers = numbers.data();
thrust::sequence(idx.begin(), idx.end(), 0);
thrust::for_each(thrust::device, idx.begin(), idx.end(), addOneFunctor);
Is it is possible to write the above so that the execution policy can be changed at either compile-time or ideally run-time?
E.g. change the struct to
struct AddOne {
thrust::pointer<int> numbers;
__host__ __device__
void operator()(int i) {
numbers[i] = numbers[i] + 1;
}
}
so it can be run something like
AddOne addOneFunctor;
std::vector<int> idx(100), numbers(100);
addOneFunctor.numbers = numbers.data();
thrust::sequence(idx.begin(), idx.end(), 0);
thrust::for_each(thrust::cpp::par, idx.begin(), idx.end(), addOneFunctor);
The bottom line is: I would like to have a single code-base where I can decide to either use thrust::device_vectors or some sort of host vector (such as std::vectors) and run these in the GPU (using thrust::device exec policy) or CPU (using thrust::cpp::par or similar policy) respectively.
PS: I would like to avoid PGI for now.
Yes, it's possible, pretty much exactly as you describe.
Here's a fully-worked example:
$ cat t1205.cu
#include <thrust/execution_policy.h>
#include <thrust/for_each.h>
#include <thrust/device_vector.h>
#include <thrust/sequence.h>
#include <iostream>
#include <vector>
struct AddOne {
int *numbers;
template <typename T>
__host__ __device__
void operator()(T &i) {
numbers[i] = numbers[i] + 1;
}
};
int main(){
AddOne addOneFunctor;
std::vector<int> idx(100), numbers(100);
addOneFunctor.numbers = thrust::raw_pointer_cast(numbers.data());
thrust::sequence(idx.begin(), idx.end(), 0);
thrust::for_each(thrust::cpp::par, idx.begin(), idx.end(), addOneFunctor);
for (int i = 0; i < 5; i++)
std::cout << numbers[i] << ",";
std::cout << std::endl;
thrust::device_vector<int> didx(100), dnumbers(100);
addOneFunctor.numbers = thrust::raw_pointer_cast(dnumbers.data());
thrust::sequence(didx.begin(), didx.end(), 0);
thrust::for_each(thrust::device, didx.begin(), didx.end(), addOneFunctor);
for (int i = 0; i < 5; i++)
std::cout << dnumbers[i] << ",";
std::cout << std::endl;
}
$ nvcc -o t1205 t1205.cu
$ ./t1205
1,1,1,1,1,
1,1,1,1,1,
$
Note that the algorithim is thrust::sequence not thrust::seq.
Using CUDA 8RC
As #m.s. points out, the explict use of the execution policies on the algorithms for the codes above are not necessary - you can remove those and it will work the same way. However the formal usage of execution policy allows the above example to be extended to the case where you are not using containers, but ordinary host and device data, so it may still have some value, depending on your overall goals.
Would this fit your requirement?
Always use thrust::device_vector to run on device;
Define different macros on compile time to select the device to be GPU or CPU (OpenMP/TBB/CPP).
More info here:
https://github.com/thrust/thrust/wiki/Device-Backends

Handle std::thread::hardware_concurrency()

In my question about std::thread, I was advised to use std::thread::hardware_concurrency(). I read somewhere (which I can not find it and seems like a local repository of code or something), that this feature is not implemented for versions of g++ prior to 4.8.
As a matter of fact, I was the at the same victim position as this user. The function will simply return 0. I found in this answer a user implementation. Comments on whether this answer is good or not are welcome!
So I would like to do this in my code:
unsinged int cores_n;
#if g++ version < 4.8
cores_n = my_hardware_concurrency();
#else
cores_n = std::thread::hardware_concurrency();
#endif
However, I could find a way to achieve this result. What should I do?
There is another way than using the GCC Common Predefined Macros: Check if std::thread::hardware_concurrency() returns zero meaning the feature is not (yet) implemented.
unsigned int hardware_concurrency()
{
unsigned int cores = std::thread::hardware_concurrency();
return cores ? cores : my_hardware_concurrency();
}
You may be inspired by awgn's source code (GPL v2 licensed) to implement my_hardware_concurrency()
auto my_hardware_concurrency()
{
std::ifstream cpuinfo("/proc/cpuinfo");
return std::count(std::istream_iterator<std::string>(cpuinfo),
std::istream_iterator<std::string>(),
std::string("processor"));
}
Based on common predefined macros link, kindly provided by Joachim, I did:
int p;
#if __GNUC__ >= 5 || __GNUC_MINOR__ >= 8 // 4.8 for example
const int P = std::thread::hardware_concurrency();
p = (trees_no < P) ? trees_no : P;
std::cout << P << " concurrent threads are supported.\n";
#else
const int P = my_hardware_concurrency();
p = (trees_no < P) ? trees_no : P;
std::cout << P << " concurrent threads are supported.\n";
#endif

Multi-threaded performance std::string

We are running some code on a project that uses OpenMP and I've run into something strange. I've included parts of some play code that demonstrates what I see.
The tests compare calling a function with a const char* argument with a std::string argument in a multi-threaded loop. The functions essentially do nothing and so have no overhead.
What I do see is a major difference in the time it takes to complete the loops. For the const char* version doing 100,000,000 iterations the code takes 0.075 seconds to complete compared with 5.08 seconds for the std::string version. These tests were done on Ubuntu-10.04-x64 with gcc-4.4.
My question is basically whether this is solely due the dynamic allocation of std::string and why in this case that can't be optimized away since it is const and can't change?
Code below and many thanks for your responses.
Compiled with: g++ -Wall -Wextra -O3 -fopenmp string_args.cpp -o string_args
#include <iostream>
#include <map>
#include <string>
#include <stdint.h>
// For wall time
#ifdef _WIN32
#include <time.h>
#else
#include <sys/time.h>
#endif
namespace
{
const int64_t g_max_iter = 100000000;
std::map<const char*, int> g_charIndex = std::map<const char*,int>();
std::map<std::string, int> g_strIndex = std::map<std::string,int>();
class Timer
{
public:
Timer()
{
#ifdef _WIN32
m_start = clock();
#else /* linux & mac */
gettimeofday(&m_start,0);
#endif
}
float elapsed()
{
#ifdef _WIN32
clock_t now = clock();
const float retval = float(now - m_start)/CLOCKS_PER_SEC;
m_start = now;
#else /* linux & mac */
timeval now;
gettimeofday(&now,0);
const float retval = float(now.tv_sec - m_start.tv_sec) + float((now.tv_usec - m_start.tv_usec)/1E6);
m_start = now;
#endif
return retval;
}
private:
// The type of this variable is different depending on the platform
#ifdef _WIN32
clock_t
#else
timeval
#endif
m_start; ///< The starting time (implementation dependent format)
};
}
bool contains_char(const char * id)
{
if( g_charIndex.empty() ) return false;
return (g_charIndex.find(id) != g_charIndex.end());
}
bool contains_str(const std::string & name)
{
if( g_strIndex.empty() ) return false;
return (g_strIndex.find(name) != g_strIndex.end());
}
void do_serial_char()
{
int found(0);
Timer clock;
for( int64_t i = 0; i < g_max_iter; ++i )
{
if( contains_char("pos") )
{
++found;
}
}
std::cout << "Loop time: " << clock.elapsed() << "\n";
++found;
}
void do_parallel_char()
{
int found(0);
Timer clock;
#pragma omp parallel for
for( int64_t i = 0; i < g_max_iter; ++i )
{
if( contains_char("pos") )
{
++found;
}
}
std::cout << "Loop time: " << clock.elapsed() << "\n";
++found;
}
void do_serial_str()
{
int found(0);
Timer clock;
for( int64_t i = 0; i < g_max_iter; ++i )
{
if( contains_str("pos") )
{
++found;
}
}
std::cout << "Loop time: " << clock.elapsed() << "\n";
++found;
}
void do_parallel_str()
{
int found(0);
Timer clock;
#pragma omp parallel for
for( int64_t i = 0; i < g_max_iter ; ++i )
{
if( contains_str("pos") )
{
++found;
}
}
std::cout << "Loop time: " << clock.elapsed() << "\n";
++found;
}
int main()
{
std::cout << "Starting single-threaded loop using std::string\n";
do_serial_str();
std::cout << "\nStarting multi-threaded loop using std::string\n";
do_parallel_str();
std::cout << "\nStarting single-threaded loop using char *\n";
do_serial_char();
std::cout << "\nStarting multi-threaded loop using const char*\n";
do_parallel_char();
}
My question is basically whether this is solely due the dynamic allocation of std::string and why in this case that can't be optimized away since it is const and can't change?
Yes, it is due to the allocation and copying for std::string on every iteration.
A sufficiently smart compiler could potentially optimize this, but it is unlikely to happen with current optimizers. Instead, you can hoist the string yourself:
void do_parallel_str()
{
int found(0);
Timer clock;
std::string const str = "pos"; // you can even make it static, if desired
#pragma omp parallel for
for( int64_t i = 0; i < g_max_iter; ++i )
{
if( contains_str(str) )
{
++found;
}
}
//clock.stop(); // Or use something to that affect, so you don't include
// any of the below expression (such as outputing "Loop time: ") in the timing.
std::cout << "Loop time: " << clock.elapsed() << "\n";
++found;
}
Does changing:
if( contains_str("pos") )
to:
static const std::string str = "pos";
if( str )
Change things much? My current best guess is that the implicit constructor call for std::string every loop would introduce a fair bit of overhead and optimising it away whilst possible is still a sufficiently hard problem I suspect.
std::string (in your case temporary) requires dynamic allocation, which is a very slow operation, compared to everything else in your loop. There are also old implementations of standard library that did COW, which also slow in multi-threaded environment. Having said that, there is no reason why compiler cannot optimize temporary string creation and optimize away the whole contains_str function call, unless you have some side effects there. Since you didn't provide implementation for that function, it's impossible to say if it could be completely optimized away.