Is it "safe" to give macros names as arguments to other macros to simulate higher order functions?
I.e. where should I look to not shoot myself in the foot?
Here are some snippets:
#define foreach_even(ii, instr) for(int ii = 0; ii < 100; ii += 2) { instr; }
#define foreach_odd(ii, instr) for(int ii = 1; ii < 100; ii += 2) { instr; }
#define sum(foreach_loop, accu) \
foreach_loop(ii, {accu += ii});
int acc = 0;
sum(foreach_even, acc);
sum(foreach_odd, acc);
What about partial application, can I do that? :
#define foreach(ii, start, end, step, instr) \
for(int ii = start; ii < end; ii += step) { instr; }
#define foreach_even(ii, instr) foreach(ii, 0, 100, instr)
#define foreach_odd(ii, instr) foreach(ii, 1, 100, instr)
#define sum(foreach_loop, accu) \
foreach_loop(ii, {accu += ii});
int acc = 0;
sum(foreach_even, acc);
sum(foreach_odd, acc);
And can I define a macro inside a macro?
#define apply_first(new_macro, macro, arg) #define new_macro(x) macro(arg,x)
If you're into using preprocessor as much as possible, you may want to try boost.preprocessor.
But be aware that it is not safe to do so. Commas, for instance, cause a great number of problems when using preprocessors. Don't forget that preprocessors do not understand (or even try to understand) any of the code they are generating.
My basic advice is "don't do it", or "do it as cautiously as possible".
I've implemented a rotten little unit testing framework entirely in c-preprocessor. Several dozen macro, lots of macro is an argument to another macro type stuff.
This kind of thing is not "safe" in a best-practices sense of the word. There are subtle and very powerful ways to shoot yourself in the foot. The unit testing project is a toy that got out of hand.
Don't know if you can nest macro definitions. I doubt it, but I'll go try...gcc doesn't like it, and responds with
nested_macro.cc:8: error: stray '#' in program
nested_macro.cc:3: error: expected constructor, destructor, or type conversion before '(' token
nested_macro.cc:3: error: expected declaration before '}' token
Self plug: If you're interested the unit testing framework can be found at https://sourceforge.net/projects/dut/
Related
I am writing a fortran code using task-based paradigm.
I use my DAG to express the dependencies.
Using OpenMP 4.5, I can use the clause depend which takes as input a dependence-type and a list of dependencies.
This mechanism works well when you know explicitly the number of dependencies.
However, in my case, I would create tasks that are expected to have a list of dependencies which varies from 1 to n elements.
Reading the documentation OpenMP-4.5_doc, I have not found any useful mechanism that allows to provide a variable list of dependencies.
Let us take an example.
Consider the computation of the traffic. A road has as dependencies the computed state of the predecessor road(s) (Hope this is clear enough).
Therefore, the computation of this road is performed when all predecessor roads traffic is computed.
Using Fortran style, we have the following sketch of code:
!road is a structure such that
! type(road) :: road%dep(:)
! integer :: traffic
type(road) :: road
!$omp task shared(road)
!$omp depend(in: road%dep) depend(inout:road)
call compute_traffic(road)
!$omp end task
What I am trying to do is to use the field %dep as a list of dependencies for openmp.
Alternatively, we can consider that %dep has a different type as a list of pointers that point to the concerned roads.
To go beyond this illustration, I work on sparse direct solver and more precisely on the Cholesky factorization and its application. Using multi frontal approach, you get many small dense blocks. The factorization as well as the solve is split into two subroutines, first the factorization (or the solve) of the diagonal block, second the update of the off diagonal blocks. The update of a dense block need the update of all previous dense blocks that share the same rows.
The fact is that I have a task to update an off-diagonal block that can depend to more than one block and obviously, the number of dependencies is related to the pattern (the structure) of the input matrix to factor. Therefore, it is not possible to determine the number of dependencies statically. That is why I am trying to give a list of blocks in the clause depend.
The feature you are looking for has been proposed under the name multiple dependency by Vidal et al. in the International Workshop on OpenMP, 2015 (see here for an open access version).
As far as I know, this feature has not found its way into OpenMP tasks (yet?), but you could use OmpSs, the OpenMP forerunner where this proposal (and many more) was implemented.
The ugly workaround otherwise, as your dependency number needs to be defined at compile time, is to write (or generate) a switch (or rater SELECT CASE for Fortran) on the number of dependencies, with each its own separate pragma.
I don't know a whole lot about Fortran I'm afraid, but in C you can get a long way with X-macros and _Pragma(). I thinkg GNU fortran uses the C preprocessor, so hopefully you can transpose this some code I once used (otherwise you're probably going to have to write all your cases by hand):
// L(n, X) = Ln(X) is a list of n macro expansions of X
#define L_EVALN(N, X) L ## N(X)
#define L(N, X) L_EVALN(N, X)
#define L1(X) X(1, b)
#define L2(X) L1(X) X(2, c)
#define L3(X) L2(X) X(3, d)
#define L4(X) L3(X) X(4, e)
#define L5(X) L4(X) X(5, f)
#define L6(X) L5(X) X(6, g)
#define L7(X) L6(X) X(7, h)
#define L8(X) L7(X) X(8, i)
#define L9(X) L8(X) X(9, j)
#define L10(X) L9(X) X(10, k)
#define L11(X) L10(X) X(11, l)
#define L12(X) L11(X) X(12, m)
#define L13(X) L12(X) X(13, n)
// Expand x, stringify, and put inside _Pragma()
#define EVAL_PRAGMA(x) _Pragma (#x)
#define DO_PRAGMA(x) EVAL_PRAGMA(x)
// X-macro to define dependecies on b{id} (size n{id})
#define OMP_DEPS(num, id) , [n_ ## id]b_ ## id
// X-macro to define symbols b{id} n{id} for neighbour #num
#define DEFINE_DEPS(num, id) \
double *b_ ## id = b[num]; \
int n_ ## id = nb[num];
// Calls each X-macros N times
#define N_OMP_DEPS(N) L(N, OMP_DEPS)
#define N_CALL_DEPS(N) L(N, CALL_DEPS)
#define N_DEFINE_DEPS(N) L(N, DEFINE_DEPS)
// defines the base task with 1 dependency on b_a == *b,
// to which we can add any number of supplementary dependencies
#define OMP_TASK(EXTRA) DO_PRAGMA(omp task depend(in: [n_a]b_a EXTRA))
// if there are N neighbours, define N deps and depend on them
#define CASE(N, ...) case N: \
{ \
N_DEFINE_DEPS(N) \
OMP_TASK(N_OMP_DEPS(N)) \
{ \
for (int i = 0; i < n; i++) b[i] = ... ; \
} \
} break;
int task(int n, int *nb, double **b)
{
double *b_a = b[0];
int nb_a = b[0];
switch(n)
{
CASE(1)
CASE(2)
CASE(3)
CASE(4)
}
}
That would generate the following code (if you prettify it):
int task(int n, int *nb, double **b)
{
double *b_a = b[0];
int nb_a = b[0];
switch (n)
{
case 1:
{
double *b_b = b[1];
int n_b = nb[1];
#pragma omp task depend(in: [n_a]b_a , [n_b]b_b)
{
for (int i = 0; i < n; i++)
b[i] = ... ;
}
} break;
case 2:
{
double *b_b = b[1];
int n_b = nb[1];
double *b_c = b[2];
int n_c = nb[2];
#pragma omp task depend(in: [n_a]b_a , [n_b]b_b , [n_c]b_c)
{
for (int i = 0; i < n; i++)
b[i] = ... ;
}
} break;
case 3:
{
double *b_b = b[1];
int n_b = nb[1];
double *b_c = b[2];
int n_c = nb[2];
double *b_d = b[3];
int n_d = nb[3];
#pragma omp task depend(in: [n_a]b_a , [n_b]b_b , [n_c]b_c , [n_d]b_d)
{
for (int i = 0; i < n; i++)
b[i] = ... ;
}
} break;
case 4:
{
double *b_b = b[1];
int n_b = nb[1];
double *b_c = b[2];
int n_c = nb[2];
double *b_d = b[3];
int n_d = nb[3];
double *b_e = b[4];
int n_e = nb[4];
#pragma omp task depend(in: [n_a]b_a , [n_b]b_b , [n_c]b_c , [n_d]b_d , [n_e]b_e)
{
for (int i = 0; i < n; i++)
b[i] = ... ;
}
} break;
}
}
As horrendous as this is, it's a workaround and its main perk is: it works.
Have a #define which is called within a member function. By this #define the member-function definition isn't found by the VS2015 environment at the declaration level. The project compiles and runs just fine so no problem there.
This does however, break the functionality of VS2015 to jump between the declaration and definition.
It can be solved by writing the #define within the source, but can this be solved without removing the #define?
File: cFoo.h
class cFoo
{
public:
int Bits;
void Member();
}
File: cFoo.cpp
#include "cFoo.h"
#define SWITCH( x ) for ( int bit = 1; x >= bit; bit *= 2) if (x & bit) switch (bit)
void cFoo::Member()
{
SWITCH( Bits )
{
case 1: break;
case 2: break;
default: break;
}
}
I would disadvice to use such constructs. It is rather counterintuitive and difficult to understand.
Possible problems/difficulties:
switch case with breaks suggests that only one case is executed, but your macro logic hides the loop.
the default case is executed multiple times depending on the highest populated bit.
using signed int as bit set -> use unsigned - its less prone to implementation defined behavior
it is possibly slow because of loop (I do not know if the compiler is able to unroll and optimize it)
its called SWITCH_BITS which suggests that bit numbers are expected, but cases have to be powers of two.
Your whole statement is not more compact than a simple if sequence.
if(bits & 1 ){
}
if(bits & 1024){
}
but you maybe want to test to bit numbers:
inline bool isBitSet(u32 i_bitset, u32 i_bit){ return i_bitset & (1 << i_bit);}
if(isBitSet(bits, 0){
}
if(isBitSet(bits, 10){
}
You can try doing this by completing the switch statement within the #define - macro:
Foo.h
#ifndef FOO_H
#define FOO_H
#ifndef SWITCH
#define SWITCH(x) \
do { \
for ( int bit = 1; (x) >= bit; bit *=2 ) { \
if ( (x) & bit ) { \
switch( bit ) { \
case 0: break; \
case 1: break; \
default: break; \
} \
} \
} \
} while(0)
#endif
class Foo {
public:
int Bits;
void Member();
};
#endif // !FOO_H
Foo.cpp
#include "Foo.h"
void Foo::Member() {
SWITCH( this->Bits );
}
When I was trying the #define or macro as you had it, Foo::Member()
was not being defined by intellisense...
However even with it not being defined I was able to build, compile and run it like this:
#ifndef SWITCH_A
#define SWITCH_A(x) \
for ( int bit = 1; (x) >= bit; bit *= 2 ) \
if ( (x) & bit ) \
switch (bit)
#endif
For some reason; MS Visual Studio was being very picky with it; meaning that I was having a hard time getting it to compile; it was even complaining about the "spacing" giving me compiler errors of missing { or ; need before... etc. Once I got all the spacing down properly it did eventually compile & run, but intellisense was not able to see that Foo::Member() was defined. When I tried it in the original manner that I've shown above; intellisense had no problem seeing that Foo::Member() was defined. Don't know if it's a pre-compiler bug, MS Visual Studio bug etc.; Just hope that this helps you.
I'm a competitive programmer, and I've been asking myself if there is any shorter, more elegant way of writingfor(int i=0; i<n; ++i) . I can only use standard C++, no other libraries.
In c++ competitions there is well known set of macros (don't use it in commercial projects). You also asked for more elegant solution (it is well known solution, but for sure not more elegant)
For example read this topcoder website:
#define REP(x, n) for(int x = 0; x < (n); ++x)
then in code you can simply write
REP(i,n){
}
One basic complete header I found:
#include <cstdio>
#include <iostream>
#include <algorithm>
#include <string>
#include <vector>
using namespace std;
typedef vector<int> VI;
typedef long long LL;
#define FOR(x, b, e) for(int x = b; x <= (e); ++x)
#define FORD(x, b, e) for(int x = b; x >= (e); – –x)
#define REP(x, n) for(int x = 0; x < (n); ++x)
#define VAR(v, n) typeof(n) v = (n)
#define ALL(c) (c).begin(), (c).end()
#define SIZE(x) ((int)(x).size())
#define FOREACH(i, c) for(VAR(i, (c).begin()); i != (c).end(); ++i)
#define PB push_back
#define ST first
#define ND second
Without running timed tests, I would assume that both:
for(int i=0; i<n; ++i)
and:
int i=0;
while (i<n)
{
i++
}
would be extraordinarily close in timing. Perhaps use timestamps within a program that runs both types of loops, and see what the overall time/loop is for each type.
These are the fundamental looping structures of C / C++, so I do not think there would be something that would run faster (but I'm willing to be wrong if I learn something new)
Seeing you didn't specify whether you need to use i how about[1]:
int i=n+1; while(--i);
Its shorter!
[1] not proven to be correct.
I'm a competitive programmer too. This answer may be off-topic, but I think it will provide some useful ideas.
Personally, I think you shouldn't focus on these types of questions. I don't think there's a big difference between writing for (int i = 1; i <= n; ++i) and FOR(i, 1, n). The first one is obviously shorter and takes less time to type, but once you get to a high enough level, problem-solving skills matter much much more than typing speed. Don't trust me? See tourist's code.
I think what you should focus on is improving your problem-solving skills. The best way is to solve as many problems as possible. Doing so will also increase your typing speed as a side effect.
In the follow two code snippets, is there actually any different according to the speed of compiling or running?
for (int i = 0; i < 50; i++)
{
if (i % 3 == 0)
continue;
printf("Yay");
}
and
for (int i = 0; i < 50; i++)
{
if (i % 3 != 0)
printf("Yay");
}
Personally, in the situations where there is a lot more than a print statement, I've been using the first method as to reduce the amount of indentation for the containing code. Been wondering for a while so found it about time I ask whether it's actually having an effect other than visually.
Reply to Alf (i couldn't get code working in comments...)
More accurate to my usage is something along the lines of a "handleObjectMovement" function which would include
for each object
if object position is static
continue
deal with velocity and jazz
compared with
for each object
if object position is not static
deal with velocity and jazz
Hence me not using return. Essentially "if it's not relevant to this iteration, move on"
The behaviour is the same, so the runtime speed should be the same unless the compiler does something stupid (or unless you disable optimisation).
It's impossible to say whether there's a difference in compilation speed, since it depends on the details of how the compiler parses, analyses and translates the two variations.
If speed is important, measure it.
If you know which branch of the condition has higher probability you may use GCC likely/unlikely macro
How about getting rid of the check altogether?
for (int t = 0; t < 33; t++)
{
int i = t + (t >> 1) + 1;
printf("%d\n", i);
}
C++ 2011 includes very cool new features, but I can't find a lot of example to parallelize a for-loop.
So my very naive question is : how do you parallelize a simple for loop (like using "omp parallel for") with std::thread ? (I search for an example).
Thank you very much.
std::thread is not necessarily meant to parallize loops. It is meant to be the lowlevel abstraction to build constructs like a parallel_for algorithm. If you want to parallize your loops, you should either wirte a parallel_for algorithm yourself or use existing libraires which offer task based parallism.
The following example shows how you could parallize a simple loop but on the other side also shows the disadvantages, like the missing load-balancing and the complexity for a simple loop.
typedef std::vector<int> container;
typedef container::iterator iter;
container v(100, 1);
auto worker = [] (iter begin, iter end) {
for(auto it = begin; it != end; ++it) {
*it *= 2;
}
};
// serial
worker(std::begin(v), std::end(v));
std::cout << std::accumulate(std::begin(v), std::end(v), 0) << std::endl; // 200
// parallel
std::vector<std::thread> threads(8);
const int grainsize = v.size() / 8;
auto work_iter = std::begin(v);
for(auto it = std::begin(threads); it != std::end(threads) - 1; ++it) {
*it = std::thread(worker, work_iter, work_iter + grainsize);
work_iter += grainsize;
}
threads.back() = std::thread(worker, work_iter, std::end(v));
for(auto&& i : threads) {
i.join();
}
std::cout << std::accumulate(std::begin(v), std::end(v), 0) << std::endl; // 400
Using a library which offers a parallel_for template, it can be simplified to
parallel_for(std::begin(v), std::end(v), worker);
Well obviously it depends on what your loop does, how you choose to parallellize, and how you manage the threads lifetime.
I'm reading the book from the std C++11 threading library (that is also one of the boost.thread maintainer and wrote Just Thread ) and I can see that "it depends".
Now to give you an idea of basics using the new standard threading, I would recommend to read the book as it gives plenty of examples.
Also, take a look at http://www.justsoftwaresolutions.co.uk/threading/ and https://stackoverflow.com/questions/415994/boost-thread-tutorials
Can't provide a C++11 specific answer since we're still mostly using pthreads. But, as a language-agnostic answer, you parallelise something by setting it up to run in a separate function (the thread function).
In other words, you have a function like:
def processArraySegment (threadData):
arrayAddr = threadData->arrayAddr
startIdx = threadData->startIdx
endIdx = threadData->endIdx
for i = startIdx to endIdx:
doSomethingWith (arrayAddr[i])
exitThread()
and, in your main code, you can process the array in two chunks:
int xyzzy[100]
threadData->arrayAddr = xyzzy
threadData->startIdx = 0
threadData->endIdx = 49
threadData->done = false
tid1 = startThread (processArraySegment, threadData)
// caveat coder: see below.
threadData->arrayAddr = xyzzy
threadData->startIdx = 50
threadData->endIdx = 99
threadData->done = false
tid2 = startThread (processArraySegment, threadData)
waitForThreadExit (tid1)
waitForThreadExit (tid2)
(keeping in mind the caveat that you should ensure thread 1 has loaded the data into its local storage before the main thread starts modifying it for thread 2, possibly with a mutex or by using an array of structures, one per thread).
In other words, it's rarely a simple matter of just modifying a for loop so that it runs in parallel, though that would be nice, something like:
for {threads=10} ({i} = 0; {i} < ARR_SZ; {i}++)
array[{i}] = array[{i}] + 1;
Instead, it requires a bit of rearranging your code to take advantage of threads.
And, of course, you have to ensure that it makes sense for the data to be processed in parallel. If you're setting each array element to the previous one plus 1, no amount of parallel processing will help, simply because you have to wait for the previous element to be modified first.
This particular example above simply uses an argument passed to the thread function to specify which part of the array it should process. The thread function itself contains the loop to do the work.
Using this class you can do it as:
Range based loop (read and write)
pforeach(auto &val, container) {
val = sin(val);
};
Index based for-loop
auto new_container = container;
pfor(size_t i, 0, container.size()) {
new_container[i] = sin(container[i]);
};
Define macro using std::thread and lambda expression:
#ifndef PARALLEL_FOR
#define PARALLEL_FOR(INT_LOOP_BEGIN_INCLUSIVE, INT_LOOP_END_EXCLUSIVE,I,O) \ \
{ \
int LOOP_LIMIT=INT_LOOP_END_EXCLUSIVE-INT_LOOP_BEGIN_INCLUSIVE; \
std::thread threads[LOOP_LIMIT]; auto fParallelLoop=[&](int I){ O; }; \
for(int i=0; i<LOOP_LIMIT; i++) \
{ \
threads[i]=std::thread(fParallelLoop,i+INT_LOOP_BEGIN_INCLUSIVE); \
} \
for(int i=0; i<LOOP_LIMIT; i++) \
{ \
threads[i].join(); \
} \
} \
#endif
usage:
int aaa=0; // std::atomic<int> aaa;
PARALLEL_FOR(0,90,i,
{
aaa+=i;
});
its ugly but it works (I mean, the multi-threading part, not the non-atomic incrementing).
AFAIK the simplest way to parallelize a loop, if you are sure that there are no concurrent access possible, is by using OpenMP.
It is supported by all major compilers except LLVM (as of August 2013).
Example :
for(int i = 0; i < n; ++i)
{
tab[i] *= 2;
tab2[i] /= 2;
tab3[i] += tab[i] - tab2[i];
}
This would be parallelized very easily like this :
#pragma omp parallel for
for(int i = 0; i < n; ++i)
{
tab[i] *= 2;
tab2[i] /= 2;
tab3[i] += tab[i] - tab2[i];
}
However, be aware that this is only efficient with a big number of values.
If you use g++, another very C++11-ish way of doing would be using a lambda and a for_each, and use gnu parallel extensions (which can use OpenMP behind the scene) :
__gnu_parallel::for_each(std::begin(tab), std::end(tab), [&] ()
{
stuff_of_your_loop();
});
However, for_each is mainly thought for arrays, vectors, etc...
But you can "cheat" it if you only want to iterate through a range by creating a Range class with begin and end method which will mostly increment an int.
Note that for simple loops that do mathematical stuff, the algorithms in #include <numeric> and #include <algorithm> can all be parallelized with G++.