Tweaking clang-format for C++20 ranges pipelines - c++

C++20 (and 23 with std::ranges::to<T>()) makes idiomatic the use of operator| to make a pipeline of transformations such as this:
return numbers
| std::views::filter([](int n) { return n % 2 == 0; })
| std::views::transform([](int n) { return n * 2; })
| std::ranges::to<std::vector>();
With my project's current .clang-format, that looks something like
return numbers | std::views::filter([](int n) { return n % 2 == 0; }) |
std::views::transform([](int n) { return n * 2; }) | std::ranges::to<std::vector>();
which I find pretty hard to read. If I set BreakBeforeBinaryOperators: All I get
return numbers | std::views::filter([](int n) { return n % 2 == 0; })
| std::views::transform([](int n) { return n * 2; }) | std::ranges::to<std::vector>();
which is better, but I'd really like the original version with one pipeline operation on each line.
I can adjust the column limit, but that is a major change and also starts to line-break my lambdas, which I don't like:
return numbers | std::views::filter([](int n) {
return n % 2 == 0;
})
| std::views::transform(
[](int n) { return n * 2; })
| std::ranges::to<std::vector>();
I can manually use empty comments to force a newline:
return numbers //
| std::views::filter([](int n) { return n % 2 == 0; }) //
| std::views::transform([](int n) { return n * 2; }) //
| std::ranges::to<std::vector>();
but again, not ideal knowing that pipelines will be pretty common. Am I missing settings? Or is this more of a feature request I should direct to clang-format, like "Add an option so when more than n operator| appears in an expression, put each subexpression on its own line."

There's a feature request for AllowBreakingBinaryOperators. Before the feature completes, only compromise can be made.
As you've said, use // comments to force line breaks.
use clang-format off/on to disable clang-format and format it yourself.
Here's a more complex solution which combines both:
void function() {
return numbers | std::views::filter([](int n) { return n % 2 == 0; })
| std::views::transform([](int n) { return n * 2; })
| std::views::take(3) | std::ranges::to<std::vector>();
}
First, use // to split and then clang-format the code.
void function() {
return numbers
//
| std::views::filter([](int n) { return n % 2 == 0; })
| std::views::transform([](int n) { return n * 2; })
| std::views::take(3)
//
| std::ranges::to<std::vector>();
}
Next, remove //, use clang-format off/on to disable clang-format.
void function() {
// clang-format off
return numbers
| std::views::filter([](int n) { return n % 2 == 0; })
| std::views::transform([](int n) { return n * 2; })
| std::views::take(3)
| std::ranges::to<std::vector>();
// clang-format on
}
As for matrix, the option AlignArrayOfStructures might help.

Related

How does this code work? The result doesnt make sense to me and doesnt appear in the debugger [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed last year.
Improve this question
#include <iostream>
using namespace std;
int f(int n, int m){
if (n==1)
return 0;
else
return f(n-1, m) + m;
}
int main()
{
cout << f(3874, 1000);
cout << endl;
return 0;
}
The result is 3873000. Why is n multiplied by m after deducting 1 and how does the function work in detail please?
The else block is executed at all levels of recursion, except the deepest one.
The number of levels in the recursion tree is n. So we have n-1 times that the else block is executed.
This else block first makes the recursive call, and then adds m to the result it gets back, and returns that sum to the caller, who will do the same, ...etc until the walk upwards the recursion tree is complete.
The original caller will thus see a base number (0) to which m was repeatedly added, exactly n-1 times.
So the function calculates m(n-1), provided that n is greater than 0. If not, the recursion will run into a stack overflow error.
Visualisation
To visualise this, let's split the second return statement into two parts, where first the result of the recursive call is stored in a variable, and then the sum is returned. Also, let's take a small value for n, like 3.
So this is then the code:
int f(int n, int m){
if (n==1)
return 0;
else {
int result = f(n-1, m)
return result + m;
}
}
int main()
{
cout << f(3, 10);
}
We can imagine each function execution (starting with main) as a box (a frame), in which local variables live their lives. Each recursive call creates a new box, and when return is executed that box vanishes again.
So we can imagine the above code to execute like this:
+-[main]----------------------------+
| f(3, 10) ... |
| +-[f]-------------------------+ |
| | n = 3, m = 10 | |
| | f(3-1, 10) ... | |
| | +-[f]-------------------+ | |
| | | n = 2, m = 10 | | |
| | | f(2-1, 10) ... | | |
| | | +-[f]-------------+ | | |
| | | | n = 1, m = 10 | | | |
| | | | return 0 | | | |
| | | +-----------------+ | | |
| | | result = 0 | | |
| | | return 0 + 10 | | |
| | +-----------------------+ | |
| | result = 10 | |
| | return 10 + 10 | |
| +-----------------------------+ |
| cout << 20 |
+-----------------------------------+
I hope this clarifies it.
The algorithm solves the recurrence
F(n) = F(n-1) + m
with
F(1) = 0.
(I removed m as an argument, as its value is constant).
We have
F(n) = F(n-1) + m = F(n-2) + 2m = F(n-3) + 3m = ... = F(1) + (n-1)m.
As written elsewhere, the recursion depth is n, which is dangerous.

Get rid of nested for loops with std::ranges

Let I have a code:
for (auto& a : x.as)
{
for (auto& b : a.bs)
{
for (auto& c : b.cs)
{
for (auto& d : c.ds)
{
if (d.e == ..)
{
return ...
}
}
}
}
}
as, bs, cs, ds - std::vector of corresponding elements.
Is it possible with std::ranges to convert four ugly loops into a beatifull one line expression?
With join and transform views, you might do:
for (auto& e : x.as | std::views::transform(&A::bs) | std::views::join
| std::views::transform(&B::cs) | std::views::join
| std::views::transform(&C::ds) | std::views::join
| std::views::transform(&D::e))
{
// ...
}
Demo

ANTLR - How to extact units from a dimension

I'm using ANTLR4 and the CSS grammar from https://github.com/antlr/grammars-v4/tree/master/css3. The grammar defines the following ( pared down a little for brevity )
dimension
: ( Plus | Minus )? Dimension
;
fragment FontRelative
: Number E M
| Number E X
| Number C H
| Number R E M
;
fragment AbsLength
: Number P X
| Number C M
| Number M M
| Number I N
| Number P T
| Number P C
| Number Q
;
fragment Angle
: Number D E G
| Number R A D
| Number G R A D
| Number T U R N
;
fragment Length
: AbsLength
| FontRelative
;
Dimension
: Length
| Angle
;
The matching works fine but I don't see an obvious way to extract the units. The parser creates a DimensionContext which has 3 TerminalNode members - Dimension, Plus and Minus. I'd like to be able to extract the unit during parse without having to do additional string parsing.
I know that one issue that the Length and Angle are fragments. I changed the grammar not use fragments
Unit
: 'em'
| 'ex'
| 'ch'
| 'rem'
| 'vw'
| 'vh'
| 'vmin'
| 'vmax'
| 'px'
| 'cm'
| 'mm'
| 'in'
| 'pt'
| 'q'
| 'deg'
| 'rad'
| 'grad'
| 'turn'
| 'ms'
| 's'
| 'hz'
| 'khz'
;
Dimension : Number Unit;
And things still parse but I don't get any more context about what the units are - the Dimension is still a single TerminalNode. Is there a way to deal with this without having to pull apart the full token string?
You will want to do as little as possible in the lexer:
NUMBER
: Dash? Dot Digit+ { atNumber(); }
| Dash? Digit+ ( Dot Digit* )? { atNumber(); }
;
UNIT
: { aftNumber() }?
( 'px' | 'cm' | 'mm' | 'in'
| 'pt' | 'pc' | 'em' | 'ex'
| 'deg' | 'rad' | 'grad' | '%'
| 'ms' | 's' | 'hz' | 'khz'
)
;
The trick is to produce the NUMBER and UNIT as separate tokens, yet limited to the required ordering. The actions in the NUMBER rule just set a flag and the UNIT predicate ensures that a UNIT can only follow a NUMBER:
protected void atNumber() {
_number = true;
}
protected boolean aftNumber() {
if (_number && Character.isWhitespace(_input.LA(1))) return false;
if (!_number) return false;
_number = false;
return true;
}
The parser rule is trivial, but preserves the detail required:
number
: NUMBER UNIT?
;
Use a tree-walk, parse the NUMBER to a Double and an enum (or equivalent) to provide the semantic UNIT characterization:
public enum Unit {
CM("cm", true, true), // 1cm = 96px/2.54
MM("mm", true, true),
IN("in", true, true), // 1in = 2.54cm = 96px
PX("px", true, true), // 1px = 1/96th
PT("pt", true, true), // 1pt = 1/72th
EM("em", false, true), // element font size
REM("rem", false, true), // root element font size
EX("ex", true, true), // element font x-height
CAP("cap", true, true), // element font nominal capital letters height
PER("%", false, true),
DEG("deg", true, false),
RAD("rad", true, false),
GRAD("grad", true, false),
MS("ms", true, false),
S("s", true, false),
HZ("hz", true, false),
KHZ("khz", true, false),
NONE(Strings.EMPTY, true, false), // 'no unit specified'
INVALID(Strings.UNKNOWN, true, false);
public final String symbol;
public final boolean abs;
public final boolean len;
private Unit(String symbol, boolean abs, boolean len) {
this.symbol = symbol;
this.abs = abs;
this.len = len;
}
public boolean isAbsolute() { return abs; }
public boolean isLengthUnit() { return len; }
// call from the visitor to resolve from `UNIT` to Unit
public static Unit find(TerminalNode node) {
if (node == null) return NONE;
for (Unit unit : values()) {
if (unit.symbol.equalsIgnoreCase(node.getText())) return unit;
}
return INVALID;
}
#Override
public String toString() {
return symbol;
}
}

range-v3: Joining piped ranges with a delimeter

I'm trying to build a basic demo of the range-v3 library: take some integers, filter out odd values, stringify them, then join those into a comma-separated list. For example, { 8, 6, 7, 5, 3, 0, 9 } becomes "8, 6, 0". From reading the docs and going through examples, it seems like the naïve solution would resemble:
string demo(const vector<int>& v)
{
return v |
ranges::view::filter([](int i) { return i % 2 == 0; }) |
ranges::view::transform([](int i) { return to_string(i); }) |
ranges::view::join(", ");
}
but building on Clang 7 fails with a static assertion that one, "Cannot get a view of a temporary container". Since I'm collecting the result into a string, I can use the eager version - action::join - instead:
string demo(const vector<int>& v)
{
return v |
ranges::view::filter([](int i) { return i % 2 == 0; }) |
ranges::view::transform([](int i) { return to_string(i); }) |
ranges::action::join;
}
but the eager version doesn't seem to have an overload that takes a delimiter.
Interestingly, the original assertion goes away if you collect join's inputs into a container first. The following compiles and runs fine:
string demo(const vector<int>& v)
{
vector<string> strings = v |
ranges::view::filter([](int i) { return i % 2 == 0; }) |
ranges::view::transform([](int i) { return to_string(i); });
return strings | ranges::view::join(", ");
}
but this totally defeats the principle of lazy evaluation that drives so much of the library.
Why is the first example failing? If it's not feasible, can action::join be given a delimiter?
action::join should accept a delimiter. Feel free to file a feature request. The actions need a lot of love.

Efficient parallelisation of a linear algebraic function in C++ OpenMP

I have little experience with parallel programming and was wondering if anyone could have a quick glance at a bit of code I've written and see, if there are any obvious ways I can improve the efficiency of the computation.
The difficulty arises due to the fact that I have multiple matrix operations of unequal dimensionality that I need to compute, so I'm not sure the most condensed way of coding the computation.
Below is my code. Note this code DOES work. The matrices I am working with are of dimension approx 700x700 [see int s below] or 700x30 [int n].
Also, I am using the armadillo library for my sequential code. It may be the case that parallelizing using openMP but retaining the armadillo matrix classes is slower than defaulting to the standard library; does anyone have an opinion on this (before I spend hours overhauling!)?
double start, end, dif;
int i,j,k; // iteration counters
int s,n; // matrix dimensions
mat B; B.load(...location of stored s*n matrix...) // input objects loaded from file
mat I; I.load(...s*s matrix...);
mat R; R.load(...s*n matrix...);
mat D; D.load(...n*n matrix...);
double e = 0.1; // scalar parameter
s = B.n_rows; n = B.n_cols;
mat dBdt; dBdt.zeros(s,n); // object for storing output of function
// 100X sequential computation using Armadillo linear algebraic functionality
start = omp_get_wtime();
for (int r=0; r<100; r++) {
dBdt = B % (R - (I * B)) + (B * D) - (B * e);
}
end = omp_get_wtime();
dif = end - strt;
cout << "Seq computation: " << dBdt(0,0) << endl;
printf("relaxation time = %f", dif);
cout << endl;
// 100 * parallel computation using OpenMP
omp_set_num_threads(8);
for (int r=0; r<100; r++) {
// parallel computation of I * B
#pragma omp parallel for default(none) shared(dBdt, B, I, R, D, e, s, n) private(i, j, k) schedule(static)
for (i = 0; i < s; i++) {
for (j = 0; j < n; j++) {
for (k = 0; k < s; k++) {
dBdt(i, j) += I(i, k) * B(k, j);
}
}
}
// parallel computation of B % (R - (I * B))
#pragma omp parallel for default(none) shared(dBdt, B, I, R, D, e, s, n) private(i, j) schedule(static)
for (i = 0; i < s; i++) {
for (j = 0; j < n; j++) {
dBdt(i, j) = R(i, j) - dBdt(i, j);
dBdt(i, j) *= B(i, j);
dBdt(i, j) -= B(i, j) * e;
}
}
// parallel computation of B * D
#pragma omp parallel for default(none) shared(dBdt, B, I, R, D, e, s, n) private(i, j, k) schedule(static)
for (i = 0; i < s; i++) {
for (j = 0; j < n; j++) {
for (k = 0; k < n; k++) {
dBdt(i, j) += B(i, k) * D(k, j);
}
}
}
}
end = omp_get_wtime();
dif = end - strt;
cout << "OMP computation: " << dBdt(0,0) << endl;
printf("relaxation time = %f", dif);
cout << endl;
If I hyper-thread 4 cores I get the following output:
Seq computation: 5.54926e-10
relaxation time = 0.130031
OMP computation: 5.54926e-10
relaxation time = 2.611040
Which suggests that although both methods produce the same result, the parallel formulation is roughly 20 times slower than the sequential.
It is possible that for matrices of this size, the overheads involved in this 'variable-dimension' problem outweighs the benefits of parallelizing. Any insights would be much appreciated.
Thanks in advance,
Jack
If you use a compiler which corrects your bad loop nests and fuses loops to improve memory locality for non parallel builds, openmp will likely disable those optimizations. As recommended by others, you should consider an optimized library such as mkl or acml. Default gfortran blas typically provided with distros is not multithreaded.
The Art of HPC is right the efficiency ( poor grants never get HPC cluster quota )
so first hope is your process will never re-read from file
Why? This would be an HPC-killer:
I need to repeat this computation many thousands of times
Fair enough to say, this comment has increased the overall need to completely review the approach and to re-design the future solution not to rely on a few tricks, but to indeed gain from your case-specific arrangement.
Last but not least - the [PARALLEL] scheduling is not needed, as a "just"-[CONCURRENT]-process scheduling is quite enough here. There is no need to orchestrate any explicit inter-process synchonisation or any message-passing and the process could just get orchestrated for the best performance possible.
No "...quick glance at a bit of code..." will help
You need to first understand both your whole process and also the hardware resources, it will be executed on.
CPU-type will tell you the available instruction set extensions for advanced tricks, L3- / L2- / L1-cache sizes + cache-line sizes will help you decide on best cache-friendly re-use of cheap data-access ( not paying hundreds [ns] if one can operate smarter on just a few [ns] instead, from a not-yet-evicted NUMA-core-local copy )
The Maths first, implementation next:
As given dBdt = B % ( R - (I * B) ) + ( B * D ) - ( B * e )
On a closer look, anyone ought be ready to realise HPC/cache-alignment priorities and wrong-looping traps:
dBdt = B % ( R - ( I * B ) ) ELEMENT-WISE OP B[s,n]-COLUMN-WISE
+ ( B * D ) SUM.PRODUCT OP B[s,n].ROW-WISE MUL-BY-D[n,n].COL
- ( B * e ) ELEMENT-WISE OP B[s,n].ROW-WISE MUL-BY-SCALAR
ROW/COL-SUM.PRODUCT OP -----------------------------------------+++++++++++++++++++++++++++++++++++++++++++++
ELEMENT-WISE OP ---------------------------------------------+ |||||||||||||||||||||||||||||||||||||||||||||
ELEMENT-WISE OP ----------------------+ | |||||||||||||||||||||||||||||||||||||||||||||
| | |||||||||||||||||||||||||||||||||||||||||||||
v v vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
dBdt[s,n] = B[s,n] % / R[s,n] - / I[s,s] . B[s,n] \ \
_________[n] _________[n] | _________[n] | ________________[s] _________[n] | |
|_| | |_| | | |_| | | |________________| | | | | |
| . | | . | | | . | | | | | | | | |
| . | | . | | | . | | | | | | | | |
| . | | . | | | . | | | | | | | | |
| . | = | . | % | | . | - | | | . | | | | |
| . | | . | | | . | | | | | | | | |
| . | | . | | | . | | | | | | | | |
| . | | . | | | . | | | | | | | | |
[s]|_________| [s]|_________| | [s]|_________| | [s]|________________| [s]|_|_______| | |
\ \ / /
B[s,n] D[n,n]
_________[n] _________[n]
|_________| | | |
| . | | | |
| . | | | |
| . | | | |
+ | . | . [n]|_|_______|
| . |
| . |
| . |
[s]|_________|
B[s,n]
_________[n]
|_| . . . |
| . |
| . |
| . |
- | . | * REGISTER_e
| . |
| . |
| . |
[s]|_________|
Having this in mind, efficient HPC loops will look much different
Depending on real-CPU-caches,
the loop may very efficiently co-process naturally-B-row-aligned ( B * D ) - ( B * e )
in a single phase and also the highest-re-use-efficiency based part of the elementwise longest-pipeline B % ( R - ( I * B ) )
here having a chance to re-use ~ 1000 x ( n - 1 ) cache-hits of B-column-aligned, which ought quite well fit into L1-DATA-cache footprints, so achieving savings in the order of seconds just from a cache-aligned loops.
After this cache-friendly loop-alignment is finished,
next may a distributed processing help, not before.
So, an experimentation plan setup:
Step 0: The Ground-Truth: ~ 0.13 [s] for dBdt[700,30] using armadillo in 100-test-loops
Step 1: The manual-serial: - test the rewards of the best cache-aligned code ( not the posted one, but the math-equivalent, cache-line re-use optimised one -- where there ought be not more than just 4x for(){...} code-blocks 2-nested, having the rest 2 inside, to meet the Linear Algebra rules without devastating benefits of cache-line alignments ( with some residual potential to benefit yet a bit more in [PTIME] from using a duplicated [PSPACE] data-layout ( both a FORTRAN-order and a C-order, for respective re-reading strategies ), as matrices are so miniature in sizes and L2- / L1-DATA-cache available per CPU-core enjoy cache sizes well grown in scale )
Step 2: The manual-omp( <= NUMA_cores - 1 ): - test if omp can indeed yield any "positive" Amdahl's Law speedup ( beyond the omp-setup overhead costs ). A carefull process-2-CPU_core affinity-mapping may help avoid any possible cache-eviction introduced by any non-HPC thread spoiling the cache-friendly layout on a set of configuration-"reserved"-set of ( NUMA_cores - 1 ), where all other ( non-HPC processes ) ought be affinity-mapped onto the last ( shared ) CPU-core, thus helping to prevent those HPC-process-cores retain their cache-lines un-evicted by any kernel/scheduler-injected non-HPC-thread.
( As seen in (2), there are arangements, derived from HPC best-practices, that none compiler ( even a magic-wand equipped one ) would ever be able to implement, so do not hesitate to ask your PhD Tutor for a helping hand, if your Thesis needs some HPC-expertise, as it is not so easy to build on trial-error in this quite expensive experimental domain and your primary domain is not the Linear Algebra and/or ultimate CS-theoretic / HW-specific cache-strategy optimisations. )
Epilogue:
Using smart tools in an inappropriate way does not bring anything more than additional overheads ( task-splits/joins + memory-translations ( worse with atomic-locking ( worst with blocking / fence / barriers ) ) ).