Related
I am working on python/pytorch and I have an example like
2d vector a
|
v
dim-0 ---> -----> dim-1 ------> -----> --------> dim-1
| [[-1.7739, 0.8073, 0.0472, -0.4084],
v [ 0.6378, 0.6575, -1.2970, -0.0625],
| [ 1.7970, -1.3463, 0.9011, -0.8704],
v [ 1.5639, 0.7123, 0.0385, 1.8410]]
|
v
Then, the argmax with the index of 1 will be
# argmax (indices where max values are present) along dimension-1
In [215]: torch.argmax(a, dim=1)
Out[215]: tensor([1, 1, 0, 3])
My question is that given the 2d vector a as above, how could I implement argmax function on C++ to give me same output as above? Thanks for reading
This is what I did
vector<vector<float>> a_vect
{
{-1.7739, 0.8073, 0.0472, -0.4084},
{0.6378, 0.6575, -1.2970, -0.0625},
{1.7970, -1.3463, 0.9011, -0.8704},
{1.5639, 0.7123, 0.0385, 1.8410}
};
std::vector<int>::iterator max = max_element(a_vect.begin() , a_vect.end()-a_vect.begin());
You can use std::max_element to find the index in each sub vector
#include <algorithm>
#include <iostream>
#include <vector>
using std::vector;
int main()
{
vector<vector<float>> a_vect=
{
{-1.7739, 0.8073, 0.0472, -0.4084},
{0.6378, 0.6575, -1.2970, -0.0625},
{1.7970, -1.3463, 0.9011, -0.8704},
{1.5639, 0.7123, 0.0385, 1.8410}
};
vector<int> max_index;
for(auto& v:a_vect)
max_index.push_back(std::max_element(v.begin(),v.end())-v.begin());
for(auto i:max_index)
std::cout << i << ' '; // 1 1 0 3
}
TL;DR
What's wrong with the last commented block of lines below?
// headers and definitions are in the down the question
int main() {
std::vector<int> v{10,20,30};
using type_of_temp = std::vector<std::pair<std::vector<int>,int>>;
// seems to work, I think it does work
auto temp = copy_range<type_of_temp>(v | indexed(0)
| transformed(complex_keep_index));
auto w = temp | transformed(distribute);
print(w);
// shows undefined behavior
//auto z = v | indexed(0)
// | transformed(complex_keep_index)
// | transformed(distribute);
//print(z);
}
Or, in other words, what makes piping v | indexed(0) into transformed(complex_keep_index) well defined, but piping v | indexed(0) | transformed(complex_keep_index) into transformed(distribute) undefined behavior?
Extended version
I have a container of things,
std::vector<int> v{10,20,30};
and I have a function which generates another container from each of those things,
// this is in general a computation of type
// T -> std::vector<U>
constexpr auto complex_comput = [](auto const& x){
return std::vector{x,x+1,x+2}; // in general the number of elements changes
};
so if I was to apply the complex_comput to v, I'd get,
{{10, 11, 12}, {20, 21, 22}, {30, 31, 32}}
and if I was to also concatenate the results, I'd finally get this:
{10, 11, 12, 20, 21, 22, 30, 31, 32}
However, I want to keep track of the index where each number came from, in a way that the result would encode something like this:
0 10
0 11
0 12
1 20
1 21
1 22
2 30
2 31
2 32
To accomplish this, I (eventually) came up with this solution, where I attempted to make use of ranges from Boost. Specifically I do the following:
use boost::adaptors::indexed to attach the index to each element of v
transform each resulting "pair" in a std::pair storing the index and result of the application of complex_comput to the value,
and finally transforming each std::pair<st::vector<int>,int> in a std::vector<std::pair<int,int>>.
However, I had to give up on the range between 2 and 3, using a helper "true" std::vector in between the two transformations.
#include <boost/range/adaptor/indexed.hpp>
#include <boost/range/adaptor/transformed.hpp>
#include <boost/range/iterator_range_core.hpp>
#include <iostream>
#include <utility>
#include <vector>
using boost::adaptors::indexed;
using boost::adaptors::transformed;
using boost::copy_range;
constexpr auto complex_comput = [](auto const& x){
// this is in general a computation of type
// T -> std::vector<U>
return std::vector{x,x+1,x+2};
};
constexpr auto complex_keep_index = [](auto const& x){
return std::make_pair(complex_comput(x.value()), x.index());
};
constexpr auto distribute = [](auto const& pair){
return pair.first | transformed([n = pair.second](auto x){
return std::make_pair(x, n);
});
};
template<typename T>
void print(T const& w) {
for (auto const& elem : w) {
for (auto i : elem) {
std::cout << i.second << ':' << i.first << ' ';
}
std::cout << std::endl;
}
}
int main() {
std::vector<int> v{10,20,30};
using type_of_temp = std::vector<std::pair<std::vector<int>,int>>;
auto temp = copy_range<type_of_temp>(v | indexed(0)
| transformed(complex_keep_index));
auto w = temp | transformed(distribute);
print(w);
//auto z = v | indexed(0)
// | transformed(complex_keep_index)
// | transformed(distribute);
//print(z);
}
Indeed, decommenting the lines defining and using z gives you a code that compiles but generates rubbish results, i.e. undefined behavior. Note that applying copy_range<type_of_temp> to the first, working, range is necessary, otherwise the resulting code is essetially the same as the one on the right of auto z =.
Why do I have to do so? What are the details that makes the oneliner not work?
I partly understand the reason, and I'll list my understanding/thoughts in the following, but I'm asking this question to get a thorough explanation of all the details of this.
I understand that the undefined behavior I observe stems from z being a range whose defining a view on some temporary which has been destroyed;
given the working version of the code, it is apparent that temporary is v | indexed(0) | transformed(complex_keep_index);
however, isn't v | indexed(0) itself a temporary that is fed to transformed(complex_keep_index)?
Probably one important detail is that the expression v | indexed(0) is no more than a lazy range, which evaluates nothing, but just sets things up such that when one iterates on the range the computations is done; after all I can easily do v | indexed(0) | indexed(0) | indexed(0), which is well defined;
and also the whole v | indexed(0) | transformed(complex_keep_index) is well defined, otherwise the code above using w would probably misbehave (I know that UB doesn't mean that the result has to show something is wrong, and things could just look ok on this hardware, in this moment, and break tomorrow).
So there's something inherently wrong is passing an rvalue to transformed(distribute);
but what's wrong in doing so lies in distribute, not in transformed, because for instance changing distribute to [](auto x){ return x; } seems to be well defined.
So what's wrong with distribute? Here's the code
constexpr auto distribute = [](auto const& pair){
return pair.first | transformed([n = pair.second](auto x){
return std::make_pair(x, n);
});
};
What's the problem with it? The returned range (output of this transformed) will hold some iterators/pointers/references to pair.first which is part of goes out of scope when distribute returns, but pair is a reference to something in the caller, which keeps living, right?
However I know that even though a const reference (e.g. pair) can keep a temporary (e.g. the elements of v | indexed(0) | transformed(complex_keep_index)) alive, that doesn't mean that the temporary stays alive when that reference goes out of scope just because it is in turn referenced by something else (references/pointers/iterators in the output of transformed([n = …](…){ … })) which doesn't go out of scope.
I think/hope that probably the answer is already in what I've written above, however I need some help to streamline all of that so that I can understand it once and for all.
#include <future>
#include <iostream>
#include <vector>
#include <cstdint>
#include <algorithm>
#include <random>
#include <chrono>
#include <utility>
#include <type_traits>
template <class Clock = std::chrono::high_resolution_clock, class Task>
double timing(Task&& t, typename std::result_of<Task()>::type* r = nullptr)
{
using namespace std::chrono;
auto begin = Clock::now();
if (r != nullptr) *r = std::forward<Task>(t)();
auto end = Clock::now();
return duration_cast<duration<double>>(end - begin).count();
}
template <typename Num>
double sum(const std::vector<Num>& v, const std::size_t l, const std::size_t h)
{
double s;
for (auto i = l; i <= h; i++) s += v[i];
return s;
}
template <typename Num>
double asum(const std::vector<Num>& v, const std::size_t l, const std::size_t h)
{
auto m = (l + h) / 2;
auto s1 = std::async(std::launch::async, sum<Num>, v, l, m);
auto s2 = std::async(std::launch::async, sum<Num>, v, m+1, h);
return s1.get() + s2.get();
}
int main()
{
std::vector<uint> v(1000);
auto s = std::chrono::system_clock::now().time_since_epoch().count();
std::generate(v.begin(), v.end(), std::minstd_rand0(s));
double r;
std::cout << 1000 * timing([&]() -> double { return asum(v, 0, v.size() - 1); }, &r) << " msec | rst " << r << std::endl;
std::cout << 1000 * timing([&]() -> double { return sum(v, 0, v.size() - 1); }, &r) << " msec | rst " << r << std::endl;
}
Hi,
So above are two functions for summing up a vector of random numbers.
I did several runs, but it seems that I did not benefit from std::async. Below are some results I got.
0.130582 msec | rst 1.09015e+12
0.001402 msec | rst 1.09015e+12
0.23185 msec | rst 1.07046e+12
0.002308 msec | rst 1.07046e+12
0.18052 msec | rst 1.07449e+12
0.00244 msec | rst 1.07449e+12
0.190455 msec | rst 1.08319e+12
0.002315 msec | rst 1.08319e+12
All four cases the async version spent more time. But ideally I should have been two times faster right?
Did I miss anything in my code?
By the way I am running on OS X 10.10.4 Macbook Air with 1.4 GHz Intel Core i5.
Thanks,
Edits:
compiler flags: g++ -o asum asum.cpp -std=c++11
I changed the flag to include -O3 and vector size to be 10000000, but the results are still weired.
72.1743 msec | rst 1.07349e+16
14.3739 msec | rst 1.07349e+16
58.3542 msec | rst 1.07372e+16
12.1143 msec | rst 1.07372e+16
57.1576 msec | rst 1.07371e+16
11.9332 msec | rst 1.07371e+16
59.9104 msec | rst 1.07395e+16
11.9923 msec | rst 1.07395e+16
64.032 msec | rst 1.07371e+16
12.0929 msec | rst 1.07371e+16
here
auto s1 = std::async(std::launch::async, sum<Num>, v, l, m);
auto s2 = std::async(std::launch::async, sum<Num>, v, m+1, h);
async will store its own vector copy, twice. You should use std::cref and make sure the futures are retrieved before the vector dies ( as it is in your current code ) and that accesses get properly synchronized ( as it is in your current code ).
As mentioned in comments, thread creation overhead may further slow down the code.
Well, this is the simplest possible example and the results should not be binding because of following reasons.
when you create a thread, it takes some extra CPU cycles to create thread context and stack. Those cycles are added to the sum function.
When main thread runs this code, the main thread was empty and not doing anything else other than doing the sum
we go for multithreaded solution only when we can't accomplish something in a single thread or we need to synchronously wait for some input.
Just by creating threads, you don't increase performance. You increase performance by carefully designing multithreaded applications where you can use empty CPU's when some other things are waiting for example IO
First, the performance of your original async function is bad compared with the sequential because it make one more copy of your test data as mentioned in other answers. Second, you might not be able to see the improvement after fixing copying issue because creating threads is not cheap and it can kill your performance gain.
From the benchmark results, I can see async version is 1.88 times faster than that of the sequential version for N = 1000000. However, if I use N = 10000 then async version is 3.55 times slower. Both non-iterator and iterator solutions produce similar results.
Beside that you should use iterator when writing your code because this approach is more flexible for example you can try different container types, will give you similar performance compared to C style version, and it also is more elegant IMHO :).
Benchmark results:
-----------------------------------------------------------------------------------------------------------------------------------------------
Group | Experiment | Prob. Space | Samples | Iterations | Baseline | us/Iteration | Iterations/sec |
-----------------------------------------------------------------------------------------------------------------------------------------------
sum | orginal | Null | 50 | 10 | 1.00000 | 1366.30000 | 731.90 |
sum | orginal_async | Null | 50 | 10 | 0.53246 | 727.50000 | 1374.57 |
sum | iter | Null | 50 | 10 | 1.00022 | 1366.60000 | 731.74 |
sum | iter_async | Null | 50 | 10 | 0.53261 | 727.70000 | 1374.19 |
Complete.
Celero
Timer resolution: 0.001000 us
-----------------------------------------------------------------------------------------------------------------------------------------------
Group | Experiment | Prob. Space | Samples | Iterations | Baseline | us/Iteration | Iterations/sec |
-----------------------------------------------------------------------------------------------------------------------------------------------
sum | orginal | Null | 50 | 10 | 1.00000 | 13.60000 | 73529.41 |
sum | orginal_async | Null | 50 | 10 | 3.55882 | 48.40000 | 20661.16 |
sum | iter | Null | 50 | 10 | 1.00000 | 13.60000 | 73529.41 |
sum | iter_async | Null | 50 | 10 | 3.53676 | 48.10000 | 20790.02 |
Complete.
Complete code sample
#include <algorithm>
#include <chrono>
#include <future>
#include <iostream>
#include <random>
#include <type_traits>
#include <vector>
#include <iostream>
#include "celero/Celero.h"
constexpr int NumberOfSamples = 50;
constexpr int NumberOfIterations = 10;
template <typename Container>
double sum(const Container &v, const std::size_t begin, const std::size_t end) {
double s = 0;
for (auto idx = begin; idx < end; ++idx) {
s += v[idx];
}
return s;
}
template <typename Container>
double sum_async(const Container &v, const std::size_t begin, const std::size_t end) {
auto middle = (begin + end) / 2;
// Removing std::cref will slow down this function because it makes two copy of v..
auto s1 = std::async(std::launch::async, sum<Container>, std::cref(v), begin, middle);
auto s2 = std::async(std::launch::async, sum<Container>, std::cref(v), middle, end);s,
return s1.get() + s2.get();
}
template <typename Iter>
typename std::iterator_traits<Iter>::value_type sum_iter(Iter begin, Iter end) {
typename std::iterator_traits<Iter>::value_type results = 0.0;
std::for_each(begin, end, [&results](auto const item) { results += item; });
return results;
}
template <typename Iter>
typename std::iterator_traits<Iter>::value_type sum_iter_async(Iter begin, Iter end) {
Iter middle = begin + std::distance(begin, end) / 2;
auto s1 = std::async(std::launch::async, sum_iter<Iter>, begin, middle);
auto s2 = std::async(std::launch::async, sum_iter<Iter>, middle, end);
return s1.get() + s2.get();
}
template <typename T> auto create_test_data(const size_t N) {
auto s = std::chrono::system_clock::now().time_since_epoch().count();
std::vector<T> v(N);
std::generate(v.begin(), v.end(), std::minstd_rand0(s));
return v;
}
// Create test data
constexpr size_t N = 10000;
using value_type = double;
auto data = create_test_data<value_type>(N);
using container_type = decltype(data);
CELERO_MAIN
BASELINE(sum, orginal, NumberOfSamples, NumberOfIterations) {
celero::DoNotOptimizeAway(sum<container_type>(data, 0, N));
}
BENCHMARK(sum, orginal_async, NumberOfSamples, NumberOfIterations) {
celero::DoNotOptimizeAway(sum_async<container_type>(data, 0, N));
}
BENCHMARK(sum, iter, NumberOfSamples, NumberOfIterations) {
celero::DoNotOptimizeAway(sum_iter(data.cbegin(), data.cend()));
}
BENCHMARK(sum, iter_async, NumberOfSamples, NumberOfIterations) {
celero::DoNotOptimizeAway(sum_iter_async(data.cbegin(), data.cend()));
In C++11, is it possible to write the following
int ns[] = { 1, 5, 6, 2, 9 };
for (int n : ns) {
...
}
as something like this
for (int n : { 1, 5, 6, 2, 9 }) { // VC++11 rejects this form
...
}
tl;dr: Upgrade your compiler for great success.
Yeah, it's valid.
The definition of ranged-for in [C++11: 6.5.4/1] gives us two variants of syntax for this construct. One takes an expression on the right-hand-side of the :, and the other takes a braced-init-list.
Your braced-init-list deduces (through auto) to a std::initializer_list, which is handy because these things may be iterated over.
[..] for a range-based for statement of the form
for ( for-range-declaration : braced-init-list ) statement
let range-init be equivalent to the braced-init-list. In each case, a range-based for statement is equivalent to
{
auto && __range = range-init;
for ( auto __begin = begin-expr,
__end = end-expr;
__begin != __end;
++__begin ) {
for-range-declaration = *__begin;
statement
}
}
[..]
So, you are basically saying:
auto ns = { 1, 5, 6, 2, 9 };
for (int n : ns) {
// ...
}
(I haven't bothered with the universal reference here.)
which in turn is more-or-less equivalent to:
std::initializer_list<int> ns = { 1, 5, 6, 2, 9 };
for (int n : ns) {
// ...
}
Now, GCC 4.8 supports this but, since "Visual Studio 11" is in fact Visual Studio 2012, you'll need to upgrade in order to catch up: initialiser lists were not supported at all until Visual Studio 2013.
It is possible to use this construction with an initializer list. Simply it seems the MS VC++ you are using does not support it.
Here is an example
#include <iostream>
#include <initializer_list>
int main()
{
for (int n : { 1, 5, 6, 2, 9 }) std::cout << n << ' ';
std::cout << std::endl;
return 0;
}
You have to include header <initializer_list> because the initializer list in the for statement is converted to std::initializer_list<int>
What would be the quickest way to find the power of 2, that a certain number (that is a power of two) used?
I'm not very skilled at mathematics, so I'm not sure how best to describe it. But the function would look similar to x = 2^y where y is the output, and x is the input. Here's a truth table of what it'd look like if that helps explain it.
0 = f(1)
1 = f(2)
2 = f(4)
3 = f(8)
...
8 = f(256)
9 = f(512)
I've made a function that does this, but I fear it's not very efficient (or elegant for that matter). Would there be a simpler and more efficient way of doing this? I'm using this to compute what area of a texture is used to buffer how drawing is done, so it's called at least once for every drawn object. Here's the function I've made so far:
uint32 getThePowerOfTwo(uint32 value){
for(uint32 n = 0; n < 32; ++n){
if(value <= (1 << n)){
return n;
}
}
return 32; // should never be called
}
Building on woolstar's answer - I wonder if a binary search of a lookup table would be slightly faster? (and much nicer looking)...
int getThePowerOfTwo(int value) {
static constexpr int twos[] = {
1<<0, 1<<1, 1<<2, 1<<3, 1<<4, 1<<5, 1<<6, 1<<7,
1<<8, 1<<9, 1<<10, 1<<11, 1<<12, 1<<13, 1<<14, 1<<15,
1<<16, 1<<17, 1<<18, 1<<19, 1<<20, 1<<21, 1<<22, 1<<23,
1<<24, 1<<25, 1<<26, 1<<27, 1<<28, 1<<29, 1<<30, 1<<31
};
return std::lower_bound(std::begin(twos), std::end(twos), value) - std::begin(twos);
}
This operation is sufficiently popular for processor vendors to come up with hardware support for it. Check out find first set. Compiler vendors offer specific functions for this, unfortunately there appears to be no standard how to name it. So if you need maximum performance you have to create compiler-dependent code:
# ifdef __GNUC__
return __builtin_ffs( x ) - 1; // GCC
#endif
#ifdef _MSC_VER
return CHAR_BIT * sizeof(x)-__lzcnt( x ); // Visual studio
#endif
If input value is only 2^n where n - integer, optimal way to find n is to use hash table with perfect hash function. In that case hash function for 32 unsigned integer could be defined as value % 37
template < size_t _Div >
std::array < uint8_t, _Div > build_hash()
{
std::array < uint8_t, _Div > hash_;
std::fill(hash_.begin(), hash_.end(), std::numeric_limits<uint8_t>::max());
for (size_t index_ = 0; index_ < 32; ++index_)
hash_[(1 << index_) % _Div] = index_;
return hash_;
}
uint8_t hash_log2(uint32_t value_)
{
static const std::array < uint8_t, 37 > hash_ = build_hash<37> ();
return hash_[value_%37];
}
Check
int main()
{
for (size_t index_ = 0; index_ < 32; ++index_)
assert(hash_log2(1 << index_) == index_);
}
Your version is just fine, but as you surmised, its O(n) which means it takes one step through the loop for every bit. You can do better. To take it to the next step, try doing the equivalent of a divide and conquer:
unsigned int log2(unsigned int value)
{
unsigned int val = 0 ;
unsigned int mask= 0xffff0000 ;
unsigned int step= 16 ;
while ( value )
{
if ( value & mask ) { val += step ; value &= ~ mask ; }
step /= 2 ;
if ( step ) { mask >>= step ; } else { mask >>= 1 ; }
}
return val ;
}
Since we're just hunting for the highest bit, we start out asking if any bits are on in the upper half of the word. If there are, we can throw away all the lower bits, else we just narrow the search down.
Since the question was marked C++, here's a version using templates that tries to figure out the initial mask & step:
template <typename T>
T log2(T val)
{
T result = 0 ;
T step= ( 4 * sizeof( T ) ) ; // half the number of bits
T mask= ~ 0L - ( ( 1L << ( 4 * sizeof( T )) ) -1 ) ;
while ( val && step )
{
if ( val & mask ) { result += step ; val >>= step ; }
mask >>= ( step + 1) / 2 ;
step /= 2 ;
}
return result ;
}
While performance of either version is going to be a blip on a modern x86 architecture, this has come up for me in embedded solutions, and in the last case where I was solving a bit search very similar to this, even the O(log N) was too slow for the interrupt and we had to use a combo of divide and conquer plus table lookup to squeeze the last few cycles out.
If you KNOW that it is indeed a power of two (which is easy enough to verify),
Try the variant below.
Full description here: http://sree.kotay.com/2007/04/shift-registers-and-de-bruijn-sequences_10.html
//table
static const int8 xs_KotayBits[32] = {
0, 1, 2, 16, 3, 6, 17, 21,
14, 4, 7, 9, 18, 11, 22, 26,
31, 15, 5, 20, 13, 8, 10, 25,
30, 19, 12, 24, 29, 23, 28, 27
};
//only works for powers of 2 inputs
static inline int32 xs_ILogPow2 (int32 v){
assert (v && (v&(v-1)==0));
//constant is binary 10 01010 11010 00110 01110 11111
return xs_KotayBits[(uint32(v)*uint32( 0x04ad19df ))>>27];
}