Related
I want to inline the function MyClass:at(), but performance isn't as I expect.
MyClass.cpp
#include <algorithm>
#include <chrono>
#include <iostream>
#include <vector>
#include <string>
// Making this a lot shorter than in my actual program
std::vector<std::vector<int>> arrarr =
{
{ 1, 70, 54, 71, 83, 51, 54, 69, 16, 92, 33, 48, 61, 43, 52, 1, 89, 19, 67, 48},
{24, 47, 32, 60, 99, 3, 45, 2, 44, 75, 33, 53, 78, 36, 84, 20, 35, 17, 12, 50},
{32, 98, 81, 28, 64, 23, 67, 10, 26, 38, 40, 67, 59, 54, 70, 66, 18, 38, 64, 70},
{67, 26, 20, 68, 2, 62, 12, 20, 95, 63, 94, 39, 63, 8, 40, 91, 66, 49, 94, 21},
{24, 55, 58, 5, 66, 73, 99, 26, 97, 17, 78, 78, 96, 83, 14, 88, 34, 89, 63, 72},
{21, 36, 23, 9, 75, 0, 76, 44, 20, 45, 35, 14, 0, 61, 33, 97, 34, 31, 33, 95},
{78, 17, 53, 28, 22, 75, 31, 67, 15, 94, 3, 80, 4, 62, 16, 14, 9, 53, 56, 92},
{16, 39, 5, 42, 96, 35, 31, 47, 55, 58, 88, 24, 0, 17, 54, 24, 36, 29, 85, 57},
{86, 56, 0, 48, 35, 71, 89, 7, 5, 44, 44, 37, 44, 60, 21, 58, 51, 54, 17, 58},
{19, 80, 81, 68, 5, 94, 47, 69, 28, 73, 92, 13, 86, 52, 17, 77, 4, 89, 55, 40},
{ 4, 52, 8, 83, 97, 35, 99, 16, 7, 97, 57, 32, 16, 26, 26, 79, 33, 27, 98, 66},
{88, 36, 68, 87, 57, 62, 20, 72, 3, 46, 33, 67, 46, 55, 12, 32, 63, 93, 53, 69},
{ 4, 42, 16, 73, 38, 25, 39, 11, 24, 94, 72, 18, 8, 46, 29, 32, 40, 62, 76, 36},
{20, 69, 36, 41, 72, 30, 23, 88, 34, 62, 99, 69, 82, 67, 59, 85, 74, 4, 36, 16},
{20, 73, 35, 29, 78, 31, 90, 1, 74, 31, 49, 71, 48, 86, 81, 16, 23, 57, 5, 54},
{ 1, 70, 54, 71, 83, 51, 54, 69, 16, 92, 33, 48, 61, 43, 52, 1, 89, 19, 67, 48},
};
class MyClass
{
public:
MyClass(std::vector<std::vector<int>> arr) : arr(arr)
{
rows = arr.size();
cols = arr.at(0).size();
}
inline auto at(int row, int col) const { return arr[row][col]; }
void arithmetic(int n) const;
private:
std::vector<std::vector<int>> arr;
int rows;
int cols;
};
MyClass.cpp:
void MyClass::arithmetic(int n) const
{
using std::chrono::high_resolution_clock;
using std::chrono::duration_cast;
using std::chrono::duration;
using std::chrono::milliseconds;
auto t1 = high_resolution_clock::now();
int highest_product = 0;
for (auto y = 0; y < rows; ++y)
{
for (auto x = 0; x < cols; ++x)
{
// Horizontal product
if (x + n < cols)
{
auto product = 1;
for (auto i = 0; i < n; ++i)
{
product *= at(y, x + i);
}
highest_product = std::max(highest_product, product);
}
}
}
auto t2 = high_resolution_clock::now();
duration<double, std::milli> ms_double = t2 - t1;
std::cout << ms_double.count() << "ms\n";
return highestProduct;
};
Now what I want know is why do I get better performance when I replace product *= at(y, x + i); with product *= arr[y][x+i];? When I test it with the first case, the timing on my large array takes roughly 6.7ms, and the second case takes 5.3ms. I thought when I inlined the function, it should be the same implementation as the second case.
Member function directly defined in the class definition (typically in header files) are implicitly inlined so using inline is useless in this case. inline do not guarantee the function is inlined. It is just an hint for the compiler. The keyword is also an important during the link to avoid the multiple-definition issue. Function that are not make inline can still be inlined if the compiler can see the code of the target function (ie. it is in the same translation unit or link time optimization are applied). For more information about this, please read Why are class member functions inlined?
Note that the inlining is typically performed in the optimization step of compilers (eg. -O1//O1). Thus without optimizations, most compilers will not inline the function.
Using std::vector<std::vector<int>> is not efficient since it is not a contiguous data structure and it require 2 indirection to access an item. Two sub-vectors next to each other can be stored far away in memory likely causing more cache misses (and/or thrashing due to the alignment). Please consider using one big flatten array and access items using y*cols+x where cols is the size of the sub-vectors (20 here). Alternatively a int[16][20] data type should do the job well if the size if fixed at compile-time.
MyClass(std::vector<std::vector<int>> arr) cause the input parameter to be copied (and so all the sub-vectors). Please consider using a const std::vector<std::vector<int>>& type.
While at is convenient for checking bounds at runtime, this feature can strongly decrease performance. Consider using the operator [] if you do not need that. You can use assertions combined with flatten arrays so to get a fast code in release and a safe code in debug (you can enable/disable them by defining the NDEBUG macro).
I use the boost::hana to_map function to remove duplicates from boost::hana tuple of types. See it at the compiler explorer. The code works very well but compiles very long (~10s). I wonder if there exist a faster solution that is compatible with boost::hana tuple.
#include <boost/hana/map.hpp>
#include <boost/hana/pair.hpp>
#include <boost/hana/type.hpp>
#include <boost/hana/basic_tuple.hpp>
#include <boost/hana/size.hpp>
using namespace boost::hana;
constexpr auto to_type_pair = [](auto x) { return make_pair(typeid_(x), x); };
template <class Tuple>
constexpr auto remove_duplicate_types(Tuple tuple)
{
return values(to_map(transform(tuple, to_type_pair)));
}
int main(){
auto tuple = make_basic_tuple(
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20
, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30
, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40
, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50
, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60
, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70
);
auto noDuplicatesTuple = remove_duplicate_types(tuple);
// Should return 1 since there is only one distinct type in the tuple
return size(noDuplicatesTuple);
}
I haven't run any benchmarks, but your example does not appear to take 10 seconds on Compiler Explorer. However, I can explain why it is a relatively slow solution, and suggest an alternative that assumes you are only interested getting a unique list of types and not retaining any run-time information in your result.
Creating large tuples and/or instantiating function templates that have large tuples in their prototypes are expensive compile-time operations.
Just your call to transform instantiates a lambda for each element which in turn instantiates pair. The input/output of this call are both large tuples.
The call to to_map makes an empty map and recursively calls insert for each element each time making a new map, but in this simple case the intermediate result will always be hana::map<int>. I'm willing to bet that this is exploding your compile-times if your actual use case is non-trivial. (It was certainly an issue when we were implementing hana::map so we made hana::make_map avoid this since it has all of its inputs up front).
All of this, and there is a significant penalty for these large function types being used in run-time code. You might notice a difference if you wrapped the operations in decltype and only used the resulting type.
Alternatively, using raw template metaprogramming can sometimes yield performance results over function template based metaprogramming. Here is an example for your use case:
#include <boost/hana/basic_tuple.hpp>
#include <boost/mp11/algorithm.hpp>
namespace hana = boost::hana;
using namespace boost::mp11;
int main() {
auto tuple = hana::make_basic_tuple(
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20
, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30
, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40
, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50
, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60
, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70
);
hana::basic_tuple<int> no_dups = mp_unique<std::decay_t<decltype(tuple)>>{};
}
https://godbolt.org/z/EnTWf6
I have two __m256i vectors, filled with 32 8-bit integers. Something like this:
__int8 *a0 = new __int8[32] {2};
__int8 *a1 = new __int8[32] {3};
__m256i v0 = _mm256_loadu_si256((__m256i*)a0);
__m256i v1 = _mm256_loadu_si256((__m256i*)a1);
How can i multiply these vectors, using something like _mm256_mul_epi8(v0, v1) (which does not exist) or any another way?
I want 2 vectors of results, because the output element width is twice the input element width. Or something that works similarly to _mm_mul_epu32 would be ok, using only the even input elements (0, 2, 4, etc.)
You want the result separated in two vectors so this is my suggestion for your question. I've tried to be clear, simple and realizable:
#include <stdio.h>
#include <x86intrin.h>
void _mm256_print_epi8(__m256i );
void _mm256_print_epi16(__m256i );
void _mm256_mul_epi8(__m256i , __m256i , __m256i* , __m256i* );
int main()
{
char a0[32] = {1, 2, 3, -4, 5, 6, 7, 8, 9, -10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, -24, 25, 26, 27, 28, 29, 30, 31, 32};
char a1[32] = {2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, -13, 14, 15, 16, 17, 18, 19, -20, 21, 22, 23, 24, -25, 26, 27, 28, 29, 30, 31, 32, 33};
__m256i v0 = _mm256_loadu_si256((__m256i*) &a0[0]);
__m256i v1 = _mm256_loadu_si256((__m256i*) &a1[0]);
__m256i r0, r1;//for 16 bit results
_mm256_mul_epi8(v0, v1, &r0, &r1);
printf("\nv0 = ");_mm256_print_epi8(v0);
printf("\nv1 = ");_mm256_print_epi8(v1);
printf("\nr0 = ");_mm256_print_epi16(r0);
printf("\nr1 = ");_mm256_print_epi16(r1);
printf("\nfinished\n");
return 0;
}
//v0 and v1 are 8 bit input vectors. r0 and r1 are 18 bit results of multiplications
void _mm256_mul_epi8(__m256i v0, __m256i v1, __m256i* r0, __m256i* r1)
{
__m256i tmp0, tmp1;
__m128i m128_v0, m128_v1;
m128_v0 = _mm256_extractf128_si256 (v0, 0);
m128_v1 = _mm256_extractf128_si256 (v1, 0);
tmp0= _mm256_cvtepi8_epi16 (m128_v0); //printf("\ntmp0 = ");_mm256_print_epi16(tmp0);
tmp1= _mm256_cvtepi8_epi16 (m128_v1); //printf("\ntmp1 = ");_mm256_print_epi16(tmp1);
*r0 =_mm256_mullo_epi16(tmp0, tmp1);
m128_v0 = _mm256_extractf128_si256 (v0, 1);
m128_v1 = _mm256_extractf128_si256 (v1, 1);
tmp0= _mm256_cvtepi8_epi16 (m128_v0); //printf("\ntmp0 = ");_mm256_print_epi16(tmp0);
tmp1= _mm256_cvtepi8_epi16 (m128_v1); //printf("\ntmp1 = ");_mm256_print_epi16(tmp1);
*r1 =_mm256_mullo_epi16(tmp0, tmp1);
}
void _mm256_print_epi8(__m256i vec)
{
char temp[32];
_mm256_storeu_si256((__m256i*)&temp[0], vec);
int i;
for(i=0; i<32; i++)
printf(" %3i,", temp[i]);
}
void _mm256_print_epi16(__m256i vec)
{
short temp[16];
_mm256_storeu_si256((__m256i*)&temp[0], vec);
int i;
for(i=0; i<16; i++)
printf(" %3i,", temp[i]);
}
The output is:
[martin#mrt Stack over flow]$ gcc -O2 -march=native mul_epi8.c -o out
[martin#mrt Stack over flow]$ ./out
v0 = 1, 2, 3, -4, 5, 6, 7, 8, 9, -10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, -24, 25, 26, 27, 28, 29, 30, 31, 32,
v1 = 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, -13, 14, 15, 16, 17, 18, 19, -20, 21, 22, 23, 24, -25, 26, 27, 28, 29, 30, 31, 32, 33,
r0 = 2, 6, 12, -20, 30, 42, 56, 72, 90, -110, 132, -156, 182, 210, 240, 272,
r1 = 306, 342, -380, 420, 462, 506, 552, 600, 650, 702, 756, 812, 870, 930, 992, 1056,
finished
[martin#mrt Stack over flow]$
NOTE: I've commented the intermediate results tmp0 and tmp1 in the recommended code.
In addition, as peter suggested in comments and provided a godbolt link, if your program loads from memory and you don't need to multiply elements in vectors you can use this code:
#include <immintrin.h>
//v0 and v1 are 8 bit input vectors. r0 and r1 are 18 bit results of multiplications
__m256i mul_epi8_to_16(__m128i v0, __m128i v1)
{
__m256i tmp0 = _mm256_cvtepi8_epi16 (v0); //printf("\ntmp0 = ");_mm256_print_epi16(tmp0);
__m256i tmp1 = _mm256_cvtepi8_epi16 (v1); //printf("\ntmp1 = ");_mm256_print_epi16(tmp1);
return _mm256_mullo_epi16(tmp0, tmp1);
}
__m256i mul_epi8_to_16_memsrc(char *__restrict a, char *__restrict b){
__m128i v0 = _mm_loadu_si128((__m128i*) a);
__m128i v1 = _mm_loadu_si128((__m128i*) b);
return mul_epi8_to_16(v0, v1);
}
int main()
{
char a0[32] = {1, 2, 3, -4, 5, 6, 7, 8, 9, -10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, -24, 25, 26, 27, 28, 29, 30, 31, 32};
char a1[32] = {2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, -13, 14, 15, 16, 17, 18, 19, -20, 21, 22, 23, 24, -25, 26, 27, 28, 29, 30, 31, 32, 33};
__m256i r0 = mul_epi8_to_16_memsrc(a0, a1);
}
I wrote a program in MSVS 2015, but I need to run it in MSVS 2013.
I get the error
"Error 1 error C2661: 'std::vector>::vector' :
no overloaded function takes 21
arguments \vmwfil04\students$\1302273\visual studio
2013\projects\dartsc++2013\dartsc++2013\gui.h 22 1"
This problem is affecting all of my vectors I've created before runtime.
What could be causing this?
offending code:
vector<int> Double{ 0, 40, 2, 36, 8, 26, 12, 20, 30, 4, 34, 6, 38, 14, 32, 16, 22, 28, 18, 24, 10 };
vector<int> Normal{ 0, 20, 1, 18, 4, 13, 6, 10, 15, 2, 17, 3, 19, 7, 16, 8, 11, 14, 9, 12, 5 };
vector<int> Treble{ 0, 60, 3, 54, 12, 39, 18, 30, 45, 6, 39, 9, 57, 21, 48, 24, 33, 42, 27, 36, 15 };
vector<int> Bull { 0, 25, 50};
Support for these list-initialisers was new in VS 2015. It's not present in VS 2013. So you can't do it.
You'll have to take the old-fashioned, C++03 approach instead.
I believe this is a bug in Visual Studio 2013 as it does support list initializers (2013 specific documentation of the functionality). Try enclosing the brackets in a parenthesis per this answer.
e.g. vector<int> Double({ 0, 40, 2, 36, 8, 26, 12, 20, 30, 4, 34, 6, 38, 14, 32, 16, 22, 28, 18, 24, 10 });
I'm trying to build someone else's c++ project-- I'm using the code from here. It's a release of the project, so I believe it's supposed to work as is. However, on compiling I get this error. I've tried moving __gammaTable into the gamma() function, but am not really sure why this isn't considered in its scope.
src/gamma.cpp:15:9: error: 'prog_uchar' does not name a type
PROGMEM prog_uchar __gammaTable[] = {
^
In file included from /usr/share/arduino/hardware/arduino/cores/arduino/Arduino.h:8:0,
from src/gamma.h:4,
from src/gamma.cpp:13:
src/gamma.cpp: In function 'byte gamma(byte)':
src/gamma.cpp:42:27: error: '__gammaTable' was not declared in this scope
return pgm_read_byte(&__gammaTable[x*2 + 2]);
^
.build/lilypad328/Makefile:339: recipe for target '.build/lilypad328/src/gamma.o' failed
make: *** [.build/lilypad328/src/gamma.o] Error 1
Make failed with code 2
And here's code tripping the error:
#include "gamma.h"
PROGMEM prog_uchar __gammaTable[] = {
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2,
2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4,
4, 4, 4, 4, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 7, 7,
7, 7, 7, 8, 8, 8, 8, 9, 9, 9, 9, 10, 10, 10, 10, 11,
11, 11, 12, 12, 12, 13, 13, 13, 13, 14, 14, 14, 15, 15, 16, 16,
16, 17, 17, 17, 18, 18, 18, 19, 19, 20, 20, 21, 21, 21, 22, 22,
23, 23, 24, 24, 24, 25, 25, 26, 26, 27, 27, 28, 28, 29, 29, 30,
30, 31, 32, 32, 33, 33, 34, 34, 35, 35, 36, 37, 37, 38, 38, 39,
40, 40, 41, 41, 42, 43, 43, 44, 45, 45, 46, 47, 47, 48, 49, 50,
50, 51, 52, 52, 53, 54, 55, 55, 56, 57, 58, 58, 59, 60, 61, 62,
62, 63, 64, 65, 66, 67, 67, 68, 69, 70, 71, 72, 73, 74, 74, 75,
76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91,
92, 93, 94, 95, 96, 97, 98, 99,100,101,102,104,105,106,107,108,
109,110,111,113,114,115,116,117,118,120,121,122,123,125,126,127
};
byte gamma(byte x) {
return pgm_read_byte(&__gammaTable[x*2 + 2]);
}
A quick fix is to change "prog_uchar" to "uchar". It looks like the table contains integers in the range (0,255), and "prog_uchar" is supposed to mean "uchar".
But the proper way to fix it is to find where "prog_uchar" is defined, and #include that file.
Your actual error is this:
src/gamma.cpp:15:9: error: 'prog_uchar' does not name a type
You are probably missing the header file where prog_uchar is included:
#include <avr/pgmspace.h>
This error:
src/gamma.cpp:42:27: error: '__gammaTable' was not declared in this scope
return pgm_read_byte(&__gammaTable[x*2 + 2]);
simply comes from the fact that, since prog_uchar is missing, the variable __gammaTable is not properly declared.