Boost::spirit illegal_backtracking exception - c++

I use Boost.Spirit.Lex and .Qi for a simple calculator project and (as usual) it gives me some pain to debug and use. The debug prints:
<expression>
<try>boost::spirit::multi_pass::illegal_backtracking
This exception is thrown and I can't understand why. I use macros in my code and it would be a pain to give a minimal example so I give the whole project. Just do "make" at the root, and then launch ./sash, a prompt will appear, if you want to test just do "-echo $5-8".
It seems that Google didn't find any similar problems about this exception...
The parser is in arithmetic/, and the call of the parser is at the end of arithmetic/evaluator.cpp
Any helps greatly appreciate.

Your code is breaking because BOOST_SPIRIT_QI_DEBUG as well as the on_error<> handler seem to use iterators after they might have been invalidated.
To be honest, I'm not completely sure how this could happen.
Background
AFAICT lexertl uses spirit::multipass<> with a split_functor input policy and a split_std_deque storage policy [1].
Now, (luckily?) the checking policy is buf_id_check which means that the iterator will check for invalidation at the time of dereference.
Iterators are expected to be invalidated if
the iterator is derefenced, growing the buffer to >16 tokens and the iterator is the only one referring to the shared state.
or somewhere along the line clear_queue is called explicitely (e.g. from the flush_multi_path primitive in the Spirit Repository)
Honestly I don't see any of these two conditions being met. A quick and dirty
token_iterator_type clone = iter; // just to make it non-unique...
in evaluator.cpp doesn't make a difference (ruling out reason #1)
Temporary disabling the docheck implementation in the buf_id_check_policy made valgrind point out that on_error<> and BOOST_SPIRIT_DEBUG* are causing invalid memory references. Commenting both indeed makes all problems go away (and the eval_expression now works).
However, this is likely not your preferred solution.
Proposed solution
Since
you're working on a fixed, in-memory container representing the input you don't really need multi_pass behaviour emulation
you're using a trivial grammar, you don't really benefit from lexertl - while you are getting a lot of added complexity (as you can see)
I've quickly refactored some code: https://github.com/sehe/sash-refactor/commits/master
commit dec31496 sanity - lets do without macros
4 files changed, 59 insertions(+), 146 deletions(-)
commit 6056574c dead code, excess scope, excess instantiation
5 files changed, 38 insertions(+), 62 deletions(-)
commit 99d441db remove lexer
9 files changed, 25 insertions(+), 177 deletions(-)
Now, you will find that your code is generally much simpler, also much shorter, not running into multi_pass limits and you can still have SPIRIT_DEBUG as well as on_error handling :) In the end
binary size in -g3 is reduced from 16Mb to 6.5Mb
a net 263 lines of code have been removed
more importantly, it works
Here's some samples (without debug output):
$ ./sash <<< '-echo $8-9'
-1
Warning: Empty environment variable "8-9".
$ ./sash <<< '-echo $8\*9'
72
Warning: Empty environment variable "8*9".
$ ./sash <<< '-echo $8\*(9-1)'
64
Warning: Empty environment variable "8*(9-1)".
$ ./sash <<< '-echo $--+-+8\*(9-1)'
-64
Warning: Empty environment variable "--+-+8*(9-1)".
[1] Which, despite it's name, buffers previously seen tokens in a std::vector<>

Related

Stata: store pathname with forbidden characters in variable

I am working with institution A's code, which contains among other things,
adopath ++ $prog
use "$prog/subDirectory/otherFile.do"
The actual do file is something I try not to change, as I know in its current state, it will work with institution A.
I am to define my own profile.do in order to make it works, where I need to set up $prog. I cannot see how institution A set up their profile.do, and whether it contains forbidden characters. $prog is supposed to contain the working directory, which in my case is
global prog "C:\Users\foobar\Google Drive\Cloud\PhD\Projects\Labor Supply\LIAB_QM2_9310_v1_test_dta\prog"
As I learned in another question, spaces are forbidden characters, which is why this setup definitively will not work. I was suggested to use double quotes,
global prog ""C:\Users\foobar\Google Drive\Cloud\PhD\Projects\Labor Supply\LIAB_QM2_9310_v1_test_dta\prog""
These worked fine to some extent, adopath ++ $prog was running smoothly now. However, the second command, use "$prog/subDirectory/otherFile.do", contains an error now. So here is my question
First best: is there a different way of defining $prog in a way that allows me to run the remainder of the code without getting errors?
Second best: Is there a safe way to rewrite use "$prog/subDirectory/otherFile.do"? That is, if I rewrite it as use $prog"/subDirectory/otherFile.do", and it proceeds working on my system, am I guaranteed it to work wherever the old code used to work? Can I safely exchange that piece of code while guaranteeing continued functionality?
The second best:
profile.do
global prog "C:\Users\foobar\Google Drive\Cloud\PhD\Projects\Labor Supply\LIAB_QM2_9310_v1_test_dta\prog"
Institution A's code
capture adopath ++ $prog
if (_rc != 0) adopath ++ "$prog"
use "$prog/subDirectory/otherFile.do"
Being really picky, you would substitute the != for ==, and 0 for the expected error code; for example 198 (for invalid syntax).
This solution respects the original code and, if necessary, will adequately handle the error produced by your profile.do file.
(Again, this wouldn't be a problem if your working directory had no blanks.)

In clang-format, what do the penalties do?

The clang-format sytle options documentation includes a number of options called PenaltyXXX. The documentation doesn't explain how these penalties should be used. Can you describe how to use these penalty values and what effect they achieve (perhaps with an example)?
When you have a line that's over the line length limit, clang-format will need to insert one or more breaks somewhere. You can think of penalties as a way of discouraging certain line-breaking behavior. For instance, say you have:
Namespaces::Are::Pervasive::SomeReallyVerySuperDuperLongFunctionName(args);
// and the column limit is here: ^
Clang-format will probably format to look a little strange:
Namespaces::Are::Pervasive::SomeReallyVerySuperDuperLongFunctionName(
args);
You might decide that you're willing to violate the line length by a character or two for cases like this, so you could steer that by setting the PenaltyExcessCharacter to a low number and PenaltyBreakBeforeFirstCallParameter to a higher number.
Personally, I really dislike when the return type is on its own line, so I set PenaltyReturnTypeOnItsOwnLine to an absurdly large number.
An aside, this system was inherited from Latex, which allows you to specify all kinds of penalties for line-breaking, pagination, and hyphenation.
Can you describe how to use these penalty values and what effect they achieve (perhaps with an example)?
You can see an example in this Git 2.15 (Q4 2017) clang-format for the Git project written in C:
See commit 42efde4 (29 Sep 2017) by Johannes Schindelin (dscho).
(Merged by Johannes Schindelin -- dscho -- in commit 42efde4, 01 Oct 2017)
You can see the old and new values here:
To illustrate those values:
clang-format: adjust line break penalties
We really, really, really want to limit the columns to 80 per line: One
of the few consistent style comments on the Git mailing list is that the
lines should not have more than 80 columns/line (even if 79 columns/line
would make more sense, given that the code is frequently viewed as diff,
and diffs adding an extra character).
The penalty of 5 for excess characters is way too low to guarantee that,
though, as pointed out by Brandon Williams.
(See this thread)
From the existing clang-format examples and documentation, it appears
that 100 is a penalty deemed appropriate for Stuff You Really Don't Want, so let's assign that as the penalty for "excess characters", i.e.
overly long lines.
While at it, adjust the penalties further: we are actually not that keen
on preventing new line breaks within comments or string literals, so the
penalty of 100 seems awfully high.
Likewise, we are not all that adamant about keeping line breaks away
from assignment operators (a lot of Git's code breaks immediately after
the = character just to keep that 80 columns/line limit).
We do frown a little bit more about functions' return types being on
their own line than the penalty 0 would suggest, so this was adjusted,
too.
Finally, we do not particularly fancy breaking before the first parameter
in a call, but if it keeps the line shorter than 80 columns/line, that's
what we do, so lower the penalty for breaking before a call's first
parameter, but not quite as much as introducing new line breaks to
comments.

ifort dialect options for very old code

I have been suddenly given a few very old fortran codes to compile and get working for my research group. Using ifort when I compile the code I get the following error:error #6526: A branch to a do-term-shared-stmt has occurred from outside the range of the corresponding inner-shared-do-construct.
Here is the bit of code that seems to be at fault:
...
IF(THRU.GT.0.D0) GO TO 120 00011900
L1=LL 00012000
A1=AA 00012100
DR1=DRR 00012200
RMAX1=RMAXX 00012300
RMIN1=RMINN 00012400
IF(DR1.EQ.0.D0) DR1=DRP 00012500
KMAX1=(RMAX1-RMIN1)/DR1+1.D-08 00012600
IF(KMAX1.GT.NN .OR. KMAX1.LE.0) KMAX1=NN 00012700
RINT1=RMAX1 00012800
IF(RMAX1.NE.0.D0) 00012900
2RMAX1=DR1*DFLOAT(KMAX1)+RMIN1 00013000
IUP=KMAX1 00013100
R=RMIN1 00013200
DO 120 K1=1,KMAX1 00013300
R=R+DR1 00013400
XX(K1)=R 00013500
120 CONTINUE 00013600
WRITE(IW,940) NAM1,IOPT1,L1,A1,DR1,RMAX1,RMIN1,ANU1,BNU1,CNU1 00013700
121 CONTINUE 00013800
...
Being quite unfamiliar with fortran I have been doing some searching and what I can see is that it does not like that IF statement branching to the terminal statement of the DO loop. Also, it seems that some older dialects or compilers supported this.
My questions is this: Is there an option that will allow ifort to successfully compile this? I.E. Is there a specific dialect compatibility option on ifort that will make this legal?
What are the side effects of that particular code on a compiler that would accept it? Is it possible that aside from the write statement the side effects are identical to going to line 121? Or was maybe the do loop supposed to go to 121?
I would consider modifying the code if not for the fact that my advisor told me that I should not make any changes whatsoever without consulting him first and so I ask this question to see if I need to consult him or not. That said if my only option is a modification to the code suggestions would be welcome for that so that I can have an idea of what needs to be done when I go to my advisor.
I'll note that the code in question isn't valid even as Fortran 77, but people did weird stuff back in the mists of time. I'll defer to any historian or person with experience. In particular, one should be very careful to understand any subtleties in the code in relation to my "answer" here.
If we assume that the intention of the goto given is to jump to code after the execution of the loop, then I answer about minimal change to the code.
Stripping line-number comments for clarity then
DO 120 K1=1,KMAX1
R=R+DR1
XX(K1)=R
120 CONTINUE
WRITE(IW,940) NAM1,IOPT1,L1,A1,DR1,RMAX1,RMIN1,ANU1,BNU1,CNU1
can be replaced by
DO K1=1,KMAX1
R=R+DR1
XX(K1)=R
END DO
120 WRITE(IW,940) NAM1,IOPT1,L1,A1,DR1,RMAX1,RMIN1,ANU1,BNU1,CNU1
I would be quite surpised if the author of the code intended the loop to continue to 121: no variable referenced by the write statement is updated in the loop. It is possible that the goto was intended to reference 121 but I do see that many variables are not updated in the section that would be skipped.

How do c/c++ compilers know which line an error is on

There is probably a very obvious answer to this, but I was wondering how the compiler knows which line of code my error is on. In some cases it even knows the column.
The only way I can think to do this is to tokenize the input string into a 2D array. This would store [lines][tokens].
C/C++ could be tokenized into 1 long 1D array which would probably be more efficient. I am wondering what the usual parsing method would be that would keep line information.
actually most of it is covered in the dragon book.
Compilers do Lexing/Parsing i.e.: transforming the source code into a tree representation.
When doing so each keyword variable etc. is associated with a line and column number.
However during parsing the exact origin of the failure might get lost and the information might be off.
This is the first step in the long, complicated path towards "Engineering a Compiler" or Compilers Theory
The short answer to that is: there's a module called "front-end" that usually takes care of many phases:
Scanning
Parsing
IR generator
IR optimizer ...
The structure isn't fixed so each compiler will have its own set of modules but more or less the steps involved in the front-end processing are
Scanning - maps character streams into words (also ignores whitespaces/comments) or tokens
Parsing - this is where syntax and (some) semantic analysis take place and where syntax errors are reported
To make this up to you: the compiler knows the location of your error because when something doesn't fit into a structure called "abstract syntax tree" (i.e. it cannot be constructed) or doesn't follow any of the syntax-directed translation rules, well.. there's something wrong and the compiler indicates the location where this didn't happen. If there's a grammar error on just one word/token then even a precise column location can be returned since nothing matched a terminal keyword: a basic token like the if keyword in the C/C++ language.
If you want to know more about this topic my suggestion is to start with the classic academic approach of the "Compiler Book" or "Dragon Book" and then, later on, possibly study an open-source front-end like Clang

How do I associate changed lines with functions in a git repository of C code?

I'm attempting to construct a “heatmap” from a multi-year history stored in a git repository where the unit of granularity is individual functions. Functions should grow hotter as they change more times, more frequently, and with more non-blank lines changed.
As a start, I examined the output of
git log --patch -M --find-renames --find-copies-harder --function-context -- *.c
I looked at using Language.C from Hackage, but it seems to want a complete translation unit—expanded headers and all—rather being able to cope with a source fragment.
The --function-context option is new since version 1.7.8. The foundation of the implementation in v1.7.9.4 is a regex:
PATTERNS("cpp",
/* Jump targets or access declarations */
"!^[ \t]*[A-Za-z_][A-Za-z_0-9]*:.*$\n"
/* C/++ functions/methods at top level */
"^([A-Za-z_][A-Za-z_0-9]*([ \t*]+[A-Za-z_][A-Za-z_0-9]*([ \t]*::[ \t]*[^[:space:]]+)?){1,}[ \t]*\\([^;]*)$\n"
/* compound type at top level */
"^((struct|class|enum)[^;]*)$",
/* -- */
"[a-zA-Z_][a-zA-Z0-9_]*"
"|[-+0-9.e]+[fFlL]?|0[xXbB]?[0-9a-fA-F]+[lL]?"
"|[-+*/<>%&^|=!]=|--|\\+\\+|<<=?|>>=?|&&|\\|\\||::|->"),
This seems to recognize boundaries reasonably well but doesn’t always leave the function as the first line of the diff hunk, e.g., with #include directives at the top or with a hunk that contains multiple function definitions. An option to tell diff to emit separate hunks for each function changed would be really useful.
This isn’t safety-critical, so I can tolerate some misses. Does that mean I likely have Zawinski’s “two problems”?
I realise this suggestion is a bit tangential, but it may help in order to clarify and rank requirements. This would work for C or C++ ...
Instead of trying to find text blocks which are functions and comparing them, use the compiler to make binary blocks. Specifically, for every C/C++ source file in a change set, compile it to an object. Then use the object code as a basis for comparisons.
This might not be feasible for you, but IIRC there is an option on gcc to compile so that each function is compiled to an 'independent chunk' within the generated object code file. The linker can pull each 'chunk' into a program. (It is getting pretty late here, so I will look this up in the morning, if you are interested in the idea. )
So, assuming we can do this, you'll have lots of functions defined by chunks of binary code, so a simple 'heat' comparison is 'how much longer or shorter is the code between versions for any function?'
I am also thinking it might be practical to use objdump to reconstitute the assembler for the functions. I might use some regular expressions at this stage to trim off the register names, so that changes to register allocation don't cause too many false positive (changes).
I might even try to sort the assembler instructions in the function bodies, and diff them to get a pattern of "removed" vs "added" between two function implementations. This would give a measure of change which is pretty much independent of layout, and even somewhat independent of the order of some of the source.
So it might be interesting to see if two alternative implementations of the same function (i.e. from different a change set) are the same instructions :-)
This approach should also work for C++ because all names have been appropriately mangled, which should guarantee the same functions are being compared.
So, the regular expressions might be kept very simple :-)
Assuming all of this is straightforward, what might this approach fail to give you?
Side Note: This basic strategy could work for any language which targets machine code, as well as VM instruction sets like the Java VM Bytecode, .NET CLR code, etc too.
It might be worth considering building a simple parser, using one of the common tools, rather than just using regular expressions. Clearly it is better to choose something you are familiar with, or which your organisation already uses.
For this problem, a parser doesn't actually need to validate the code (I assume it is valid when it is checked in), and it doesn't need to understand the code, so it might be quite dumb.
It might throw away comments (retaining new lines), ignore the contents of text strings, and treat program text in a very simple way. It mainly needs to keep track of balanced '{' '}', balanced '(' ')' and all the other valid program text is just individual tokens which can be passed 'straight through'.
It's output might be a separate file/function to make tracking easier.
If the language is C or C++, and the developers are reasonably disciplined, they might never use 'non-syntactic macros'. If that is the case, then the files don't need to be preprocessed.
Then a parser is mostly just looking for a the function name (an identifier) at file scope followed by ( parameter-list ) { ... code ... }
I'd SWAG it would be a few days work using yacc & lex / flex & bison, and it might be so simple that their is no need for the parser generator.
If the code is Java, then ANTLR is a possible, and I think there was a simple Java parser example.
If Haskell is your focus, their may be student projects published which have made a reasonable stab at a parser.