c++ transform and lambda - replace for loop - c++

I want to replace a for loop with std::transform. Since I have little experience with algorithms and lambda functions I wonder if this is the correct way
Original Code
for (size_t i=0; i < dataPhase.size(); ++i)
{
dataPhase[i] = fmod(dataPhase[i], pi*1.00001);
}
std::transform with a lambda
std::transform(dataPhase.begin(), dataPhase.end(), dataPhase.begin(),
[](double v){return fmod(v, pi*1.00001); }
);
do I need a capture here?
What could I do to replace a for loop in such cases, where the index is used, as in this code:
const int halfsize = int(length/2);
for (size_t i=0; i < length; ++i)
{
axis[i] = int(i) - halfsize;
}
EDIT:
I would like to expand the question (if allowed).
Is it possible to replace the for loop in this case with something different
for(std::vector<complex<double> >::size_type i = 0; i != data.size(); i++) {
dataAmplitude[i] = abs(data[i]);
dataPhase[i] = arg(data[i]);
}
Here not the original vector is modified, but its value used for two different vectors.

Part 1)
You do not need a capture here because you are only using parameters (v) and globals (pi) in the lambda code.
A capture is only needed if the lambda has to access variables from the current scope (i.e. declared in your function). You can capture by reference (&) or by value (=).
Here is an example where a 'capture by reference' is needed because of 'result' being modified from within the lambda (but it also captures the 'searchValue'):
size_t count(const std::vector<char>& values, const char searchValue)
{
size_t result = 0;
std::for_each(values.begin(), values.end(), [&](const char& v) {
if (v == searchValue)
++result;
});
return result;
}
(In real world please use std::count_if() or even std::count())
The compiler creates an unnamed functor (see this question) for each capturing lamda. The constructor of the function takes the parameters and stores it as member variables. So a 'capture by value' always uses the value the element had at the time the lambda was defined.
Here is an example of a code the compiler could generate for the lambda we created earlier:
class UnnamedLambda
{
public:
UnnamedLambda(size_t& result_, const char& searchValue_)
: result(result_), searchValue (searchValue_)
{}
void operator()(const char& v)
{
// here is the code from the lambda expression
if (v == searchValue)
++result;
}
private:
size_t& result;
const char& searchValue;
};
and our function could be rewritten to:
size_t count(const std::vector<char>& values, const char searchValue)
{
size_t result = 0;
UnnamedLambda unnamedLambda(result, searchValue);
for(auto it = values.begin(); it != values.end(); ++it)
unnamedLambda(*it);
return result;
}
Part 2)
If you need the index just continue using a for loop.
std::transform allows processing single elements and therefore does not provide an index. There are some other algorithms like std::accumulate which work on an intermediate result but I do not know anything that provides an index.

All of your examples can be transformed into a use of std::transform, with some additional objects doing legwork (I use boost here because it is prior art for most of the classes needed)
// 1
for (size_t i=0; i < dataPhase.size(); ++i)
{
dataPhase[i] = fmod(dataPhase[i], pi*1.00001);
}
// 2
const int halfsize = int(length/2);
for (size_t i=0; i < length; ++i)
{
axis[i] = int(i) - halfsize;
}
// 3
for(std::vector<complex<double> >::size_type i = 0; i != data.size(); i++) {
dataAmplitude[i] = abs(data[i]);
dataPhase[i] = arg(data[i]);
}
As you correctly note, 1 becomes
std::transform(dataPhase.begin(), dataPhase.end(), dataPhase.begin(),
[](double v){return fmod(v, pi*1.00001); } );
2 needs a sequence of numbers, so I use boost::integer_range
const int halfsize = int(length/2);
// provides indexes 0, 1, ... , length - 1
boost::integer_range<int> interval = boost::irange(0, length);
std::transform(interval.begin(), interval.end(), axis.begin(),
[halfsize](int i) {return i - halfsize;});
3 involves a pair (2-tuple) of outputs, so I use boost::zip_iterator as a destination
std::transform(data.begin(), data.end(),
// turns a pair of iterators into an iterator of pairs
boost::make_zip_iterator(dataAmplitude.begin(), dataPhase.begin()),
[](complex<double> d) { return boost::make_tuple(abs(d), arg(d)); });

Here are some examples of lambda captures:
[] Capture nothing
[&] Capture any referenced variable by reference
[=] Capture any referenced variable by making a copy
[=, &foo] Capture any referenced variable by making a copy, but
capture variable foo by reference
[bar] Capture bar by making a copy; don't copy anything else
[this] Capture the this pointer of the enclosing class
Hence if pi in your example is a local variable (not a macro or a global variable) you can not let [] but use [pi] to capture by copy (which is sane for a double):
std::transform(dataPhase.begin(), dataPhase.end(), dataPhase.begin(),
[pi](double v){return fmod(v, pi*1.00001); }
);
For your second example there is no built-in std::transform providing the index. I think the better solution is to keep your for loop.
If you really want to use lambda (as an exercise) you can use:
const int halfsize = int(length/2);
auto lambda=[&axis,halfsize](const int i){axis[i] = i - halfsize;};
for (size_t i=0; i < length; ++i)
{
lambda(i);
}
or
const int halfsize = int(length/2);
auto lambda=[halfsize](const int i){return i - halfsize;};
for (size_t i=0; i < length; ++i)
{
axis[i] = lambda(i);
}
It only depends on how you want to design your code.
Remark 1: it seems that you want to avoid "basic" for loops, however they are not necessary evil especially if you want to use OpenMP to gain some performances (simd or multi-threading). For instance
#pragma omp simd
for(auto& v_i :v) { // or worst std::transform
v_i = fmod(v_i, pi*1.00001);
}
is not supported and will not compile.
However
#pragma omp simd
for (size_t i=0; i < dataPhase.size(); ++i)
{
dataPhase[i] = fmod(dataPhase[i], pi*1.00001);
}
can be compiled with g++ -fopenmp ... with potential perf gain if simd can be used. Concerning multi-threading, one can argue that there is a coming support for parallel execution of the STL algorithms, but this is only for C++17.
Remark 2: not in C++ but in D language you have a foreach instruction that let you optionally include the index:
foreach (e; [4, 5, 6]) { writeln(e); }
// 4 5 6
but
foreach (i, e; [4, 5, 6]) { writeln(i, ":", e); }
// 0:4 1:5 2:6

Related

A C++ lambda expression defined in a local variable to pass it to the parallel_for

I'm trying to implement a conditional (size-depending) processing of std::vector. If it should be processed in serial mode on a single thread, I simply call a function defined as lambda expressin within a plain 'for' loop, and it works OK. But when I tried to pass that lambda function as an argument to the concurrency::parallel_for (after '#include <ppl.h>'), it can't compile saying theres' no matching function to call the 'parallel_for'.
Array3<T> res = this->cloned();
size_t imax = res.getCount();
// my lambda expression to process elements of std::vector
std::function<void(size_t)> lambdaFun = [&](size_t i) {
res[i] *= rhs[i];
};
size_t i;
if (ll())
{
// ll job
concurrency::parallel_for(size_t(0), imax, lambdaFun); // fails to compile
}
else
{
// serial job
for (i = 0; i < imax; i++)
{
lambdaFun(i); // works great
}
}
However, when I pass a body of that lambda function to the 'concurrency::parallel_for' directly, it works:
concurrency::parallel_for(size_t(0), imax, [&](size_t i)
{
res[i] *= rhs[i]; // the same code as it was in the lambda works OK
}
);
I tried to change the type of a local variable storing the lambda function this way:
auto lambdaFun = [&](size_t i)
{
res[i] *= rhs[i];
};
But it did not help.
So the question is: how to properly pass the lambda function defined in local variable 'lambdaFun' to 'parallel_for' without expicitly typing contents of its body in the loop?
Thanks to you all!
Finally I get it worked. The counter size_t i was declared gobally in the member method, that is why my lambda has presumably treated the counter being not appropriate for what the parallel_for expects.
Also I decided to leave the function type as auto.
The right code is here:
Array3<T> res = this->cloned();
size_t imax = res.getCount();
auto lambdaFun = [&](size_t i) {
res[i] *= rhs[i];
};
if (res.ll())
{
// ll job
concurrency::parallel_for(size_t(0), imax, lambdaFun);
}
else
{
// serial job
for (size_t i = 0; i < imax; i++)
{
lambdaFun(i);
}
}

Find a combination of elements from different arrays

I recently started learning C++ and ran into problems with this task:
I am given 4 arrays of different lengths with different values.
vector<int> A = {1,2,3,4};
vector<int> B = {1,3,44};
vector<int> C = {1,23};
vector<int> D = {0,2,5,4};
I need to implement a function that goes through all possible variations of the elements of these vectors and checks if there are such values a from array A, b from array B, c from array C and d from array D that their sum would be 0(a+b+c+d=0)
I wrote such a program, but it outputs 1, although the desired combination does not exist.
using namespace std;
vector<int> test;
int sum (vector<int> v){
int sum_of_elements = 0;
for (int i = 0; i < v.size(); i++){
sum_of_elements += v[i];
}
return sum_of_elements;
}
bool abcd0(vector<int> A,vector<int> B,vector<int> C,vector<int> D){
for ( int ai = 0; ai <= A.size(); ai++){
test[0] = A[ai];
for ( int bi = 0; bi <= B.size(); bi++){
test[1] = B[bi];
for ( int ci = 0; ci <= C.size(); ci++){
test[2] = C[ci];
for ( int di = 0; di <= D.size(); di++){
test[3] = D[di];
if (sum (test) == 0){
return true;
}
}
}
}
}
}
I would be happy if you could explain what the problem is
Vectors don't increase their size by themself. You either need to construct with right size, resize it, or push_back elements (you can also insert, but vectors arent very efficient at that). In your code you never add any element to test and accessing any element, eg test[0] = A[ai]; causes undefined behavior.
Further, valid indices are [0, size()) (ie size() excluded, it is not a valid index). Hence your loops are accessing the input vectors out-of-bounds, causing undefined behavior again. The loops conditions should be for ( int ai = 0; ai < A.size(); ai++){.
Not returning something from a non-void function is again undefined behavior. When your abcd0 does not find a combination that adds up to 0 it does not return anything.
After fixing those issues your code does produce the expected output: https://godbolt.org/z/KvW1nePMh.
However, I suggest you to...
not use global variables. It makes the code difficult to reason about. For example we need to see all your code to know if you actually do resize test. If test was local to abcd0 we would only need to consider that function to know what happens to test.
read about Why is “using namespace std;” considered bad practice?
not pass parameters by value when you can pass them by const reference to avoid unnecessary copies.
using range based for loops helps to avoid making mistakes with the bounds.
Trying to change not more than necessary, your code could look like this:
#include <vector>
#include <iostream>
int sum (const std::vector<int>& v){
int sum_of_elements = 0;
for (int i = 0; i < v.size(); i++){
sum_of_elements += v[i];
}
return sum_of_elements;
}
bool abcd0(const std::vector<int>& A,
const std::vector<int>& B,
const std::vector<int>& C,
const std::vector<int>& D){
for (const auto& a : A){
for (const auto& b : B){
for (const auto& c : C){
for (const auto& d : D){
if (sum ({a,b,c,d}) == 0){
return true;
}
}
}
}
}
return false;
}
int main() {
std::vector<int> A = {1,2,3,4};
std::vector<int> B = {1,3,44};
std::vector<int> C = {1,23};
std::vector<int> D = {0,2,5,4};
std::cout << abcd0(A,B,C,D);
}
Note that I removed the vector test completely. You don't need to construct it explicitly, but you can pass a temporary to sum. sum could use std::accumulate, or you could simply add the four numbers directly in abcd0. I suppose this is for exercise, so let's leave it at that.
Edit : The answer written by #463035818_is_not_a_number is the answer you should refer to.
As mentioned in the comments by #Alan Birtles, there's nothing in that code that adds elements to test. Also, as mentioned in comments by #PaulMcKenzie, the condition in loops should be modified. Currently, it is looping all the way up to the size of the vector which is invalid(since the index runs from 0 to the size of vector-1). For implementing the algorithm that you've in mind (as I inferred from your code), you can declare and initialise the vector all the way down in the 4th loop.
Here's the modified code,
int sum (vector<int> v){
int sum_of_elements = 0;
for (int i = 0; i < v.size(); i++){
sum_of_elements += v[i];
}
return sum_of_elements;
}
bool abcd0(vector<int> A,vector<int> B,vector<int> C,vector<int> D){
for ( int ai = 0; ai <A.size(); ai++){
for ( int bi = 0; bi <B.size(); bi++){
for ( int ci = 0; ci <C.size(); ci++){
for ( int di = 0; di <D.size(); di++){
vector<int> test = {A[ai], B[bi], C[ci], D[di]};
if (sum (test) == 0){
return true;
}
}
}
}
}
return false;
}
The algorithm is inefficient though. You can try sorting the vectors first. Loop through the first two of them while using the 2 pointer technique to check if desired sum is available from the remaining two vectors
It looks to me, like you're calling the function every time you want to check an array. Within the function you're initiating int sum_of_elements = 0;.
So at the first run, you're starting with int sum_of_elements = 0;.
It finds the element and increases sum_of_elements up to 1.
Second run you're calling the function and it initiates again with int sum_of_elements = 0;.
This is repeated every time you're checking the next array for the element.
Let me know if I understood that correctly (didn't run it, just skimmed a bit).

Why is std::string::append so much faster than push_back in this benchmark?

Given the following benchmark:
const char* payload = "abcdefghijk";
const std::size_t payload_len = 11;
const std::size_t payload_count = 1000;
static void StringAppend(benchmark::State& state) {
for (auto _ : state) {
std::string created_string;
created_string.reserve(payload_len * payload_count + 1);
for(int i = 0 ; i < payload_count; ++i) {
created_string.append(payload, payload_len);
}
benchmark::DoNotOptimize(created_string);
}
}
BENCHMARK(StringAppend);
static void StringBackInsert(benchmark::State& state) {
for (auto _ : state) {
std::string created_string;
created_string.reserve(payload_len * payload_count + 1);
auto inserter = std::back_inserter(created_string);
for(int i = 0 ; i < payload_count; ++i) {
for(std::size_t i = 0; i < payload_len; ++i) {
*inserter = payload[i];
++inserter;
}
}
benchmark::DoNotOptimize(created_string);
}
}
BENCHMARK(StringBackInsert);
static void StringPushBack(benchmark::State& state) {
for (auto _ : state) {
std::string created_string;
created_string.reserve(payload_len * payload_count + 1);
for(int i = 0 ; i < payload_count; ++i) {
for(std::size_t i = 0; i < payload_len; ++i) {
created_string.push_back(payload[i]);
}
}
benchmark::DoNotOptimize(created_string);
}
}
BENCHMARK(StringPushBack);
I get the following on quickbench, which show a very dramatic difference:
Considering that all the required memory is allocated ahead of time, I'm having a lot of trouble buying into the idea that just doing the size vs capacity check represents essentially all of the cost here, unless maybe there's a massive number of load-hit-store or branch misprediction involved.
http://quick-bench.com/XQ9kepYFE1_dZD8vVaQwOUSSVoE
What I'd like to understand is:
Is there something specific to this setup that the compiler is using that will not necessarily apply in a real-world scenario?
If so, is there a way to rearrange this benchmark to be more representative?
Compilers don't always succeed at optimizing a byte-at-a-time loop into something non-terrible. You're comparing a known-length append against a whole inner loop that calls push_back.
push_back includes a size check, so using it this way checks and conditionally reallocates+copies after any byte pushed, which probably defeats the ability of compilers to optimize it.
But append only has to check between whole chunks, and I assume clang can inline that 11-byte memcpy to use a couple loads / stores instead of 11 byte loads / byte stores.

How to optimize reusing a large std::unordered_map as a temporary in a frequently called function?

Simplified question with a working example: I want to reuse a std::unordered_map (let's call it umap) multiple times, similar to the following dummy code (which does not do anything meaningful). How can I make this code run faster?
#include <iostream>
#include <unordered_map>
#include <time.h>
unsigned size = 1000000;
void foo(){
std::unordered_map<int, double> umap;
umap.reserve(size);
for (int i = 0; i < size; i++) {
// in my real program: umap gets filled with meaningful data here
umap.emplace(i, i * 0.1);
}
// ... some code here which does something meaningful with umap
}
int main() {
clock_t t = clock();
for(int i = 0; i < 50; i++){
foo();
}
t = clock() - t;
printf ("%f s\n",((float)t)/CLOCKS_PER_SEC);
return 0;
}
In my original code, I want to store matrix entries in umap. In each call to foo, the key values start from 0 up to N, and N can be different in each call to foo, but there is an upper limit of 10M for indices. Also, values can be different (contrary to the dummy code here which is always i*0.1).
I tried to make umap a non-local variable, for avoiding the repeated memory allocation of umap.reserve() in each call. This requires to call umap.clear() at the end of foo, but that turned out to be actually slower than using a local variable (I measured it).
I don't think there is any good way to accomplish what you're looking for directly -- i.e. you can't clear the map without clearing the map. I suppose you could allocate a number of maps up-front, and just use each one of them a single time as a "disposable map", and then go on to use the next map during your next call, but I doubt this would give you any overall speedup, since at the end of it all you'd have to clear all of them at once, and in any case it would be very RAM-intensive and cache-unfriendly (in modern CPUs, RAM access is very often the performance bottleneck, and therefore minimizing the number cache misses is the way to maximize effiency).
My suggestion would be that if clear-speed is so critical, you may need to move away from using unordered_map entirely, and instead use something simpler like a std::vector -- in that case you can simply keep a number-of-valid-items-in-the-vector integer, and "clearing" the vector is a matter of just setting the count back to zero. (Of course, that means you sacrifice unordered_map's quick-lookup properties, but perhaps you don't need them at this stage of your computation?)
A simple and effective way is reusing same container and memory again and again with pass-by-reference as follows.
In this method, you can avoid their recursive memory allocation std::unordered_map::reserve and std::unordered_map::~unordered_map which both have the complexity O(num. of elemenrs):
void foo(std::unordered_map<int, double>& umap)
{
std::size_t N = ...// set N here
for (int i = 0; i < N; ++i)
{
// overwrite umap[0], ..., umap[N-1]
// If umap does not have key=i, then it is inserted.
umap[i] = i*0.1;
}
// do something and not access to umap[N], ..., umap[size-1] !
}
The caller side would be as follows:
std::unordered_map<int,double> umap;
umap.reserve(size);
for(int i=0; i<50; ++i){
foo(umap);
}
But since your key set is always continuous integers {1,2,...,N}, I think that std::vector which enables you to avoid hash calculations would be more preferable to save values umap[0], ..., umap[N]:
void foo(std::vector<double>& vec)
{
int N = ...// set N here
for(int i = 0; i<N; ++i)
{
// overwrite vec[0], ..., vec[N-1]
vec[i] = i*0.1;
}
// do something and not access to vec[N], ..., vec[size-1] !
}
Have you tried to avoid all memory allocation by using a simple array? You've said above that you know the maximum size of umap over all calls to foo():
#include <iostream>
#include <unordered_map>
#include <time.h>
constexpr int size = 1000000;
double af[size];
void foo(int N) {
// assert(N<=size);
for (int i = 0; i < N; i++) {
af[i] = i;
}
// ... af
}
int main() {
clock_t t = clock();
for(int i = 0; i < 50; i++){
foo(size /* or some other N<=size */);
}
t = clock() - t;
printf ("%f s\n",((float)t)/CLOCKS_PER_SEC);
return 0;
}
As I suggested in the comments, closed hashing would be better for your use case. Here's a quick&dirty closed hash map with a fixed hashtable size you could experiment with:
template<class Key, class T, size_t size = 1000003, class Hash = std::hash<Key>>
class closed_hash_map {
typedef std::pair<const Key, T> value_type;
typedef typename std::vector<value_type>::iterator iterator;
std::array<int, size> hashtable;
std::vector<value_type> data;
public:
iterator begin() { return data.begin(); }
iterator end() { return data.end(); }
iterator find(const Key &k) {
size_t h = Hash()(k) % size;
while (hashtable[h]) {
if (data[hashtable[h]-1].first == k)
return data.begin() + (hashtable[h] - 1);
if (++h == size) h = 0; }
return data.end(); }
std::pair<iterator, bool> insert(const value_type& obj) {
size_t h = Hash()(obj.first) % size;
while (hashtable[h]) {
if (data[hashtable[h]-1].first == obj.first)
return std::make_pair(data.begin() + (hashtable[h] - 1), false);
if (++h == size) h = 0; }
data.emplace_back(obj);
hashtable[h] = data.size();
return std::make_pair(data.end() - 1, true); }
void clear() {
data.clear();
hashtable.fill(0); }
};
It can be made more flexible by dynamically resizing the hashtable on demand when appropriate, and more efficient by using robin-hood replacment.

Declaring a function that takes generic input and output iterators

I would like to modify this function so that mimics standard library algorithms by taking input iterators and writing to an output iterator instead of what it's currently doing.
Here is the code:
template <class T>
std::vector<std::vector<T>> find_combinations(std::vector<std::vector<T>> v) {
unsigned int n = 1;
for_each(v.begin(), v.end(), [&](std::vector<T> &a){ n *= a.size(); });
std::vector<std::vector<T>> combinations(n, std::vector<T>(v.size()));
for (unsigned int i = 1; i <= n; ++i) {
unsigned int rate = n;
for (unsigned int j = 0; j != v.size(); ++j) {
combinations[i-1][j] = v[j].front();
rate /= v[j].size();
if (i % rate == 0) std::rotate(v[j].begin(), v[j].begin() + 1, v[j].end());
}
}
return combinations;
}
How it's used:
std::vector<std::vector<int>> input = { { 1, 3 }, { 6, 8 } };
std::vector<std::vector<int>> result = find_combinations(input);
My problem is writing the declaration. I'm assuming that it involves iterator traits but I haven't been able to figure out the syntax.
First of all, don't pass vectors by value. A return value may be optimized and moved (even if it's nor c++11) , as an input parameter it's hard for the compiler to know if it can just pass a reference.
Second, you can't initialize a vector of vectors like that.
Now, for the syntax, just use:
std::vector<std::vector<T>> find_combinations(std::vector<std::vector<T>>& v) {
}
It will work fine.