Performance penalty of using boost::irange over raw loop - c++

A few answers and discussions and even the source code of boost::irange mention that there should be a performance penalty to using these ranges over raw for loops.
However, for example for the following code
#include <boost/range/irange.hpp>
int sum(int* v, int n) {
int result{};
for (auto i : boost::irange(0, n)) {
result += v[i];
}
return result;
}
int sum2(int* v, int n) {
int result{};
for (int i = 0; i < n; ++i) {
result += v[i];
}
return result;
}
I see no differences in the generated (-O3 optimized) code (Compiler Explorer). Does anyone see an example where using such an integer range could lead to worse code generation in modern compilers?
EDIT: Clearly, debug performance might be impacted, but that's not my aim here.
Concerning the strided (step size > 1) example, I think it might be possible to modify the irange code to more closely match the code of a strided raw for-loop.

Does anyone see an example where using such an integer range could lead to worse code generation in modern compilers?
Yes. It is not stated that your particular case is affected. But changing the step to anything else than 1:
#include <boost/range/irange.hpp>
int sum(int* v, int n) {
int result{};
for (auto i : boost::irange(0, n, 8)) {
result += v[i]; //^^^ different steps
}
return result;
}
int sum2(int* v, int n) {
int result{};
for (int i = 0; i < n; i+=8) {
result += v[i]; //^^^ different steps
}
return result;
}
Live.
While sum now looks worse (the loop did not get unrolled) sum2 still benefits from loop unrolling and SIMD optimization.
Edit:
To comment on your edit, it's true that it might be possible to modify the irange code to more closely. But:
To fit how range-based for loops are expanded, boost::irange(0, n, 8) must create some sort of temporary, implementing begin/end iterators and a prefix operator++ (which is crearly not as trivial as an int += operation). Compilers are using pattern matching for optimization, which is trimmed to work with standard C++ and standard libraries. Thus, any result from irange; if it is slightly different than a pattern the compiler knows to optimize, optimization won't kick in. And I think, these are the reason why the author of the library mentions performance penalties.

Related

Why is vector<vector<int>> slower than vector<int> []?

I was trying to solve leetcode323. My code and my way to solve the problem was basically identical to the official answer. The only difference was that I was using vector<vector> while the official answer used vector [] to keep the neighbors of each node. When I used vector [], the system accepted my answer. Is there any advantages of using vector [] over using vector<vector>? I put my code and the official solution code below. Thank you so much in advance.
My code:
class Solution {
public :
void explore(vector<bool> & visited,vector<int> nei[],int cur){
visited[cur]=true;
for(int i=0;i<nei[cur].size();i++){
if(!visited[nei[cur][i]]){
explore(visited,nei,nei[cur][i]);
}
}
}
public:
int countComponents(int n, vector<vector<int>>& edges) {
vector<bool> visited(n);
vector<vector<int>> neighbors(n);
int count=0;
for(int i=0;i<edges.size();i++){
neighbors[edges[i][0]].push_back(edges[i][1]);
neighbors[edges[i][1]].push_back(edges[i][0]);
}
for(int j=0;j<n;j++){
if(!visited[j]){
count++;
explore(visited,neighbors,j);
}
}
return count;
}
};
Official solution
class Solution {
public: void dfs(vector<int> adjList[], vector<int> &visited, int src) {
visited[src] = 1;
for (int i = 0; i < adjList[src].size(); i++) {
if (visited[adjList[src][i]] == 0) {
dfs(adjList, visited, adjList[src][i]);
}
}
}
int countComponents(int n, vector<vector<int>>& edges) {
if (n == 0) return 0;
int components = 0;
vector<int> visited(n, 0);
vector<int> adjList[n];
for (int i = 0; i < edges.size(); i++) {
adjList[edges[i][0]].push_back(edges[i][1]);
adjList[edges[i][1]].push_back(edges[i][0]);
}
for (int i = 0; i < n; i++) {
if (visited[i] == 0) {
components++;
dfs(adjList, visited, i);
}
}
return components;
}
};
I'm not sure, but I think the main problem of your solution is std::vector<bool> which is special case of std::vector.
In the '90s memory size was a problem. So to save memory std::vector<bool> is a specialization of std::vector template and single bits are used to store bool value.
This compacts memory, but comes with performance penalty. Now this has to remain forever to be compatible with already existing code.
I would recommend you to replace std::vector<bool> with std::vector<char> and do not change anything else. Let implicit conversion between bool and char do the magic.
Second candidate is missing reserve for adjList[i] as mentioned in other answer, but "official" solution doesn't do that either.
Here I refactor your code.
The only difference was that I was using vector<vector<int>>
There are several differences:
Official uses (non standard C++) VLA, whereas you use compliant vector<vector<int>>.
I would say that VLA "allocation" (similar to alloca) is faster than real allocation from std::vector (new[]).
From your test, assuming timing is done correctly, VLA seems to have a real impact.
Official use vector<int> whereas you use std::vector<bool>
With specialization, vector<bool> is more compact than std::vector<int /*or char*/> but would require a little more work to set/retrieve individual value.
You have some different names.
Naming difference should not impact runtime.
In some circonstance, very long difference and template usage might impact compilation time. But it should not be the case here.
Order of parameters of dfs/explore are different.
It might probably allow micro optimization in some cases, but swapping the 2 vectors doesn't seem to be relevant here.
Is there any advantages of using vector [] over using vector<vector>?
VLA is non standard C++, that is a big disadvantage.
Stack is generally more limited than heap, so size of "array" is more limited.
Its advantage seems to be a faster allocation.
The usage speed should be similar though.

GCC, std::array and for vs ranged-based for

I'm trying to use (large-ish) arrays of structs containing only a single small std::array each.
Especially the K == 1 case should be well supported (see code).
However, GCC seems to be unable to properly optimize these arrays in most cases, especially when using ranged-based for.
Clang produces code that I don't fully understand but seems to be well optimized (and uses SSE, which GCC doesn't).
#include <vector>
#include <array>
template<int K>
struct Foo {
std::array<int, K> vals;
int bar() const {
int sum = 0;
#if 0 // Foo Version A
for (auto f : vals)
sum += f;
#else // Foo Version B
for (auto i = 0; i < K; ++i)
sum += vals[i];
#endif
return sum;
}
};
int test1(std::vector<Foo<1>> const& foos)
{
int sum = 0;
for (auto const& f : foos)
sum += f.bar();
return sum;
}
// Version C
int test2(std::vector<std::array<int, 1>> const& foos)
{
int sum = 0;
for (auto const& f : foos)
for (auto const& v : f)
sum += v;
return sum;
}
// Version D
int test3(std::vector<std::array<int, 1>> const& foos)
{
int sum = 0;
for (auto const& f : foos)
for (auto i = 0; i < f.size(); ++i)
sum += f[i];
return sum;
}
Godbolt Code, gcc 7.2, flags -O2 -std=c++11 -march=native. Older gcc versions behave similarly.
If I'm not mistaken, then all four versions should have the same semantic.
Furthermore, I would expect all versions to compile to about the same assembly.
The assembly should only have one frequently used conditional jump (for iterating over the vector).
However, the following happens:
Version A (ranged-based for, array-in-struct): 3 conditional jumps, one at the beginning for handling zero-length vectors. Then one for the vector (this is ok). But then another for iterating over the array? Why? It has constant size 1.
Version B (manual for, array-in-struct): Here, GCC actually recognizes that the array of length 1 can be optimized, the assembly looks good.
Version C (ranged-based for, direct array): The loop over the array is not optimized away, so again two conditional jumps for the looping. Also: this version contains more memory access that I thought would be required.
Version D (manual for, direct array): This one is the only version that looks sane to me. 11 instructions.
Clang creates way more assembly code (same flags) but it's pretty much the same for all versions and contains a lot of loop unrolling and SSE.
Is this a GCC-related problem? Should I file a bug? Is this something in my C++ code that I should/can fix?
EDIT: updated the godbolt url due to a fix in Version B. Behavior is now the same as Version D, which makes this a pure manual-for vs. ranged-for issue.

Why does this C++ function takes 4 time as much as a c function

I am considering using c++ for a performance critical application. I thought both C and C++ will have comparable running times. However i see that the c++ function takes >4 times to run that the comparable C snippet.
When i did a disassemble i saw the end(), ++, != were all implemented as function calls. Is it possible to make them (at least some of them) inline?
Here is the C++ code:
typedef struct pfx_s {
unsigned int start;
unsigned int end;
unsigned int count;
} pfx_t;
typedef std::list<pfx_t *> pfx_list_t;
int
eval_one_pkt (pfx_list_t *cfg, unsigned int ip_addr)
{
const_list_iter_t iter;
for (iter = cfg->begin(); iter != cfg->end(); iter++) {
if (((*iter)->start <= ip_addr) &&
((*iter)->end >= ip_addr)) {
(*iter)->count++;
return 1;
}
}
return 0;
}
And this is the equivalent C code:
int
eval_one_pkt (cfg_t *cfg, unsigned int ip_addr)
{
pfx_t *pfx;
TAILQ_FOREACH (pfx, &cfg->pfx_head, next) {
if ((pfx->start <= ip_addr) &&
(pfx->end >= ip_addr)) {
pfx->count++;
return 1;
}
}
return 0;
}
It might be worth noting that the data structures you used are not entirely equivalent. Your C list is implemented as a list of immediate elements. Your C++ list is implemented as a list of pointers to actual elements. Why did you make your C++ list a list of pointers?
This alone will not, of course, cause four-fold difference in performance. However, it could affect the code's performance do to its worse memory locality.
I would guess that you timed debug version of your code, maybe even compiled with debug version of the library.
Do you have a really good reason to use a list here at all? At first glance, it looks like a std::vector will be a better choice. You probably also don't want a container of pointers, just a container of objects.
You can also do the job quite a bit more neatly a standard algorithm:
typedef std::vector<pfx_t> pfx_list_t;
int
eval_one_pkt(pfx_list_t const &cfg, unsigned int ip_addr) {
auto pos = std::find_if(cfg.begin(), cfg.end(),
[ip_addr](pfx_t const &p) {
return ip_addr >= p.begin && ip_addr <= p.end;
});
if (pos != cfg.end()) {
++(pos->count);
return 1;
}
return 0;
}
If I were doing it, however, I'd probably turn that into generic algorithm instead:
template <class InIter>
int
eval_one_pkt(InIter b, InIter e, unsigned int ip_addr) {
auto pos = std::find_if(b, e,
[ip_addr](pfx_t const &p) {
return ip_addr >= p.begin && ip_addr <= p.end;
});
if (pos != cfg.end()) {
++(pos->count);
return 1;
}
return 0;
}
Though unrelated to C vs. C++, for a possible slight further optimization on the range check you might want to try something like this:
return ((unsigned)(ip_addr-p.begin) <= (p.end-p.begin));
With a modern compiler with optimization enabled, I'd expect the template to be expanded inline entirely at the point of use, so there probably wouldn't be any function calls involved at all.
I copied your code and ran timings of 10,000 failed (thus complete) searches of 10,000 element lists:
Without optimization:
TAILQ_FOREACH 0.717s
std::list<pfx_t *> 2.397s
std::list<pfx_t> 1.98s
(Note that I put a next into pfx_t for TAILQ and used the same redundant structure with std::list)
You can see that lists of pointers is worse than lists of objects. Now with optimization:
TAILQ_FOREACH 0.467s
std::list<pfx_t *> 0.553s
std::list<pfx_t> 0.345s
So as everyone pointed out, optimization is the dominant term in a tight inner loop using collection types. Even the slowest variation is faster than the fastest unoptimized version. Perhaps more surprising is that the winner changes -- this is likely due to the compiler better recognizing optimization opportunities in the std code than in an OS-provided macro.

Fastest way to check equality with tolerance within a range?

The following function compare two arrays, and returns true if all elements are equal taking in account a tolerance.
// Equal
template<typename Type>
bool eq(const unsigned int n, const Type* x, const Type* y, const Type tolerance)
{
bool ok = true;
for(unsigned int i = 0; i < n; ++i) {
if (std::abs(x[i]-y[i]) > std::abs(tolerance)) {
ok = false;
break;
}
}
return ok;
}
Is there a way to beat the performances of this function ?
Compute abs(tolerance) outside the loop.
You might try unrolling the loop into a 'major' loop and a 'minor' loop where the 'minor' loop's only jump is to its beginning and the 'major' loop has the 'if' and 'break' stuff. Do something like ok &= (x[i]-y[i] < abstol) & (y[i]-x[i] < abstol); in the minor loop to avoid branching -- note & instead of &&.
Then partially unroll and vectorise the minor loop. Then specialise for whatever floating-point types you're actually using and use your platform's SIMD instructions to do the minor loop.
Think before doing this, of course, since it can increase code size and thereby have ill effects on maintainability and sometimes the performance of other parts of your system.
You can avoid those return variable assignments, and precalculate the absolute value of tolerance:
// Equal
template<typename Type>
bool eq(const unsigned int n, const Type* x, const Type* y, const Type tolerance) {
const Type absTolerance = std::abs(tolerance);
for(unsigned int i = 0; i < n; ++i) {
if (std::abs(x[i]-y[i]) > absTolerance) {
return false;
}
}
return true;
}
Also, if you know the tolerance will be always possitive there's no need to calculate its absolute value. If not, you may take it as a precondition.
I would do it like this, you can roll a C++03 version with class functors also, it will be more verbose but should be equally efficient:
std::equal(x, x+n, y, [&tolerance](Type a, Type b) -> bool { return ((a-b) < tolerance) && ((a-b) > -tolerance); }
Major difference is dropping the abs: depending on Type and how abs is implemented you might get a conditional execution path extra, with lots of branch mispredictions, this should certainly avoid that. The duplicate calculation of a-b will likely be optimized away by the compiler (if it deems necessary).
Of course, it introduces an extra operator requirement for Type and if operators < or > are slow, it might be slower then abs (measure it).
Also, std::equal is a standard algorithm doing all that looping and early breaking for you, it's always a good idea to use a standard library for this. It's usually nicer to maintain (in C++11 at least) and could get optimized better because you clearly show intent.

How do I deal with "signed/unsigned mismatch" warnings (C4018)?

I work with a lot of calculation code written in c++ with high-performance and low memory overhead in mind. It uses STL containers (mostly std::vector) a lot, and iterates over that containers almost in every single function.
The iterating code looks like this:
for (int i = 0; i < things.size(); ++i)
{
// ...
}
But it produces the signed/unsigned mismatch warning (C4018 in Visual Studio).
Replacing int with some unsigned type is a problem because we frequently use OpenMP pragmas, and it requires the counter to be int.
I'm about to suppress the (hundreds of) warnings, but I'm afraid I've missed some elegant solution to the problem.
On iterators. I think iterators are great when applied in appropriate places. The code I'm working with will never change random-access containers into std::list or something (so iterating with int i is already container agnostic), and will always need the current index. And all the additional code you need to type (iterator itself and the index) just complicates matters and obfuscates the simplicity of the underlying code.
It's all in your things.size() type. It isn't int, but size_t (it exists in C++, not in C) which equals to some "usual" unsigned type, i.e. unsigned int for x86_32.
Operator "less" (<) cannot be applied to two operands of different sign. There's just no such opcodes, and standard doesn't specify, whether compiler can make implicit sign conversion. So it just treats signed number as unsigned and emits that warning.
It would be correct to write it like
for (size_t i = 0; i < things.size(); ++i) { /**/ }
or even faster
for (size_t i = 0, ilen = things.size(); i < ilen; ++i) { /**/ }
Ideally, I would use a construct like this instead:
for (std::vector<your_type>::const_iterator i = things.begin(); i != things.end(); ++i)
{
// if you ever need the distance, you may call std::distance
// it won't cause any overhead because the compiler will likely optimize the call
size_t distance = std::distance(things.begin(), i);
}
This a has the neat advantage that your code suddenly becomes container agnostic.
And regarding your problem, if some library you use requires you to use int where an unsigned int would better fit, their API is messy. Anyway, if you are sure that those int are always positive, you may just do:
int int_distance = static_cast<int>(distance);
Which will specify clearly your intent to the compiler: it won't bug you with warnings anymore.
If you can't/won't use iterators and if you can't/won't use std::size_t for the loop index, make a .size() to int conversion function that documents the assumption and does the conversion explicitly to silence the compiler warning.
#include <cassert>
#include <cstddef>
#include <limits>
// When using int loop indexes, use size_as_int(container) instead of
// container.size() in order to document the inherent assumption that the size
// of the container can be represented by an int.
template <typename ContainerType>
/* constexpr */ int size_as_int(const ContainerType &c) {
const auto size = c.size(); // if no auto, use `typename ContainerType::size_type`
assert(size <= static_cast<std::size_t>(std::numeric_limits<int>::max()));
return static_cast<int>(size);
}
Then you write your loops like this:
for (int i = 0; i < size_as_int(things); ++i) { ... }
The instantiation of this function template will almost certainly be inlined. In debug builds, the assumption will be checked. In release builds, it won't be and the code will be as fast as if you called size() directly. Neither version will produce a compiler warning, and it's only a slight modification to the idiomatic loop.
If you want to catch assumption failures in the release version as well, you can replace the assertion with an if statement that throws something like std::out_of_range("container size exceeds range of int").
Note that this solves both the signed/unsigned comparison as well as the potential sizeof(int) != sizeof(Container::size_type) problem. You can leave all your warnings enabled and use them to catch real bugs in other parts of your code.
You can use:
size_t type, to remove warning messages
iterators + distance (like are first hint)
only iterators
function object
For example:
// simple class who output his value
class ConsoleOutput
{
public:
ConsoleOutput(int value):m_value(value) { }
int Value() const { return m_value; }
private:
int m_value;
};
// functional object
class Predicat
{
public:
void operator()(ConsoleOutput const& item)
{
std::cout << item.Value() << std::endl;
}
};
void main()
{
// fill list
std::vector<ConsoleOutput> list;
list.push_back(ConsoleOutput(1));
list.push_back(ConsoleOutput(8));
// 1) using size_t
for (size_t i = 0; i < list.size(); ++i)
{
std::cout << list.at(i).Value() << std::endl;
}
// 2) iterators + distance, for std::distance only non const iterators
std::vector<ConsoleOutput>::iterator itDistance = list.begin(), endDistance = list.end();
for ( ; itDistance != endDistance; ++itDistance)
{
// int or size_t
int const position = static_cast<int>(std::distance(list.begin(), itDistance));
std::cout << list.at(position).Value() << std::endl;
}
// 3) iterators
std::vector<ConsoleOutput>::const_iterator it = list.begin(), end = list.end();
for ( ; it != end; ++it)
{
std::cout << (*it).Value() << std::endl;
}
// 4) functional objects
std::for_each(list.begin(), list.end(), Predicat());
}
C++20 has now std::cmp_less
In c++20, we have the standard constexpr functions
std::cmp_equal
std::cmp_not_equal
std::cmp_less
std::cmp_greater
std::cmp_less_equal
std::cmp_greater_equal
added in the <utility> header, exactly for this kind of scenarios.
Compare the values of two integers t and u. Unlike builtin comparison operators, negative signed integers always compare less than (and not equal to) unsigned integers: the comparison is safe against lossy integer conversion.
That means, if (due to some wired reasons) one must use the i as integer, the loops, and needs to compare with the unsigned integer, that can be done:
#include <utility> // std::cmp_less
for (int i = 0; std::cmp_less(i, things.size()); ++i)
{
// ...
}
This also covers the case, if we mistakenly static_cast the -1 (i.e. int)to unsigned int. That means, the following will not give you an error:
static_assert(1u < -1);
But the usage of std::cmp_less will
static_assert(std::cmp_less(1u, -1)); // error
I can also propose following solution for C++11.
for (auto p = 0U; p < sys.size(); p++) {
}
(C++ is not smart enough for auto p = 0, so I have to put p = 0U....)
I will give you a better idea
for(decltype(things.size()) i = 0; i < things.size(); i++){
//...
}
decltype is
Inspects the declared type of an entity or the type and value category
of an expression.
So, It deduces type of things.size() and i will be a type as same as things.size(). So,
i < things.size() will be executed without any warning
I had a similar problem. Using size_t was not working. I tried the other one which worked for me. (as below)
for(int i = things.size()-1;i>=0;i--)
{
//...
}
I would just do
int pnSize = primeNumber.size();
for (int i = 0; i < pnSize; i++)
cout << primeNumber[i] << ' ';