clojure terminating parenthesis syntax - clojure

Is there any reason why the expression
(foo5 (foo4 (foo3 (foo2 (foo1 arg)))))
cannot be replaced with
(foo5 (foo4 (foo3 (foo2 (foo1 arg)-)
or the like, and then expanded back?
I know lack of reader macros means that you cannot change syntax, but can this expansion possibly be hard coded into the java?
I do this when I hand write code.

Yes, you could do this, even without reader macros (in fact, you can change Clojures syntax with a bit of hacking).
But, the question is, what would it gain you? Would it always expand to top-level? But then cutting and pasting code would fail, if you moved it to or from top level. And, of course, all the various tools that operate of clojure syntax would need to understand it.
Ultimately if you really dislike all the close parens why not use
(-> arg foo1 foo2 foo3 foo4)
instead?

Yes, this could be done, but I'm not sure it is the right solution and there are a number of negatives which will likely outweigh the benefits.
Suggestions like this are often the result of poor coding tools and a 'traditional' conceptual model for writing code. Selecting the right tools and looking at your code from a slightly different perspective will usually eliminate the cause which lead to this type of suggestion.
Most of the non-functional, non-lispy style languages are based around a token and line model of code. You tend to think of the code in terms of lines of tokens and you tend to edit the code on this basis. There is typically less nesting of expressions and lines are usually terminated with some marker, such as a semi-colan. Likewise, tools such as your editor, have features which have evolved to support token and line based editing. They are good at it.
The lisp style languages are less focused on lines of tokens. The emphasis here is on list forms. lines of tokens are replaced with nested lists of symbols - the line is less relevant and you typically have a lot more nesting of forms. This change means your standard line oriented tools, like your editor, are less suitable. The typical mental model of the code as lines of tokens is also less useful.
With languages like Clojure, your better off thinking in terms of list forms and not lines of code. Once you make this transition, you then start looking for tools which also model the code along these lines. For example, you either look for editors specifically designed to work with lists of data rather than lines of data or you look for editors which have extensions which will allow you to work with lists.
Once your editor understands that lists are the fundamental grouping unit, not lines, things like parenthesis become largely irrelevant from a code writing/editing perspective. You don't worry about closing parenthesis, counting parenthesis nesting levels etc. This all gets managed by the editor automatically. You don't move by lines, you move by lists, you don't kill/delete a line, you kill a list, you don't cut and copy a block of lines, you cut and copy a list of lists etc.
The good news is that in many respects, the structure of these list based code representations are actually easier to manipulate than most of the line based languages. This is primarily because there is less ambiguity or complexity. There are fewer exceptions to the rules and the rules are inherently simple. As a consequence, many editors designed for programmers will have support for this style of coding as well as advanced features which are difficult to implement in less structured code.
My suspicion is that your suggestion to have an additional bit of syntactic sugar to avoid having to type multiple closing parenthesis is actually a symptom of not having the right tools to write your code. Once you do, you will almost never need to enter a closing parenthesis or count opening parens to ensure you get the nesting right. This will be handled by the editor. Your biggest challenge will be in shifting your mental model to think in terms of lists and lists of lists. The parens will become largely invisible and you will jump around in your code according to list units rather than line units. The change is not easy and it can take some time to re-train your brain and fingers, but once you do, you will likely be surprised at how quickly you begin to edit and manipulate your code.
If your an emacs user, I highly recommend extensions such as paredit and lispy. If your using some other editor, look for paredit type extensions. However, as these are extensions, you must also spend some time training yourself to use whatever the key bindings are that the extension uses - there is no point having an extension with great code navigaiton based on lists if you still just arrow around with the arrow keys (unless it is emacs and you have re-bound those arrow keys to use the paredit navigation bindings).

Related

c++ parser and formatter using a single grammar declaration

I have this idea to be able to 'declare' a grammar and use the same declaration for generating the format function.
A parser generator (e.g. antlr) is able to generate the parser from a bnf grammar.
But is there a way to use the same grammar to generate the formatting code?
I just want to avoid manually having to sync the parsing code (generated) with a manually written formatting code, since the grammar is the same.
could I use the abstract syntax tree?
boost::spirit? metaprogramming?
anyone tried this?
thanks
It's not clear to me whether this question is looking for an existing product or library (in which case, the question would be out-of-scope for Stack Overflow), or in algorithms for automatically generating a pretty printer from (some formalism for) a grammar. Here, I've tried to provide some pointers for the second possibility.
There is a long history of research into syntax-directed pretty printing, and a Google or Citeseer search on that phrase will probably give you lots of reading material. I'd recommend trying to find a copy of Derek Oppen's 1979 paper, Prettyprinting, which describes a linear-time algorithm based on the insertion of a few pretty-printing operators into the tokenized source code.
Oppen's basic operators are fairly simple: they consist of indications about how code segments are to be (recursively) grouped, about where newlines must and might be inserted, and about where in a group to increase indentation depth. With the set of proposed operators, it is possible to create an on-line algorithm which prefers to break lines higher up in the parse tree, avoiding the tendency to over-indent deeply-nested code, which is a classic failing of naïve indentation algorithms.
In essence, the algorithm uses a two-finger solution, where the leading finger consumes new tokens and notices when the line must be wrapped, at which point it signals the trailing finger. The trailing finger then finds the earliest point at which a newline could be inserted and all the additional newlines and indents which must be inserted to conform with the operators, advancing until there is no newline between the fingers.
The on-line algorithm might not produced optimal indentation/reflowing (and it is not immediately obvious what the definition of "optimal" might be); for certain aspects of the pretty-printing, it might be useful to think about the ideas in Donald Knuth's optimal line-wrapping algorithm, as described in his 1999 text, Digital Typography. (More references in the Wikipedia article on line wrapping.)
Oppen's algorithm is not perfect (as indicated in the paper) but it may be "good enough" for many practical purposes. (I note some limitations below.) Tracing the citation history of this paper will give you a number of implementations, improvements, and alternate algorithms.
It's clear that a parser generator could easily be modified to simply insert pretty-printing annotations into a token stream, and I believe that there have been various attempts to create yacc-like pretty-printer generators. (And possibly ANTLR derivatives, too.) The basic idea is to embed the pretty printing annotations in the grammar description, which allows the automatic generation of a reduction action which outputs an annotated token stream.
Syntax-directed pretty printing was added to the ASF+SDF Meta-Environment using a similar annotation system; the basic algorithm and formalism is described by M.G.J. van der Brand in Pretty Printing in the ASF+SDF Meta-environment Past, Present and Future (1995), which also makes for interesting reading. (ASF+SDF has since been superseded by the Rascal Metaprogramming Language, which includes visualization tools.)
One important issue with syntax-directed pretty printing algorithms is that they are based on the parse of a tokenized stream, which means that comments have already been removed. Clearly it is desirable that comments be retained in a pretty-printed version of a program, but correctly attaching comments to the relevant code is not trivial, particularly when the comment is on the same line as some code. Consider, for example, the case of a commented-out operation embedded into code:
// This is the simplified form of actual code
int needed_ = (current_ /* + adjustment_ */ ) * 2;
Or the common case of trailing comments used to document variables:
/* Tracking the current allocation */
int needed_; // Bytes required.
int current_; // Bytes currently allocated.
// int adjustment_; // (TODO) Why is this needed?
/* Either points to the current allocation, or is 0 */
char* buffer_;
In the above example, note the importance of whitespace: the comments may apply to the previous declaration (even though they appear after the semicolon which terminates it) or to the following declaration(s), mostly depending on whether they are suffix comments or full-line comments, but the commented-out code is an exception. Also, the programmer has attempted to line up the names of the member variables.
Another problem with automated syntax-directed pretty-printing is handling incorrect (or incomplete) programs, as would need to be done if the pretty-printing is part of a Development Environment. Error-handling (and error recovery) is by far the most difficult part of automatically-generated parsers; maintaining useful pretty printing in this context is even more complicated. It's precisely for this reason that most IDEs use a form of peephole pretty-printing (another possible search phrase), or even adaptive pretty-printing where user indentation is used as a guide to the location of as-yet-unwritten code.
OP asks, Has anyone tried this?
Yes. Our DMS Software Reengineering Toolkit can do this: you give it just a grammar, you get a parser that builds ASTs and you get a prettyprinter. We've used this on
lots of parse/changeAST/unparse tasks for many language over the last 20 years, preserving the meaning of the source program exactly.
The process is to parse according to the grammar, build an AST, and then walk the AST to carry out prettyprinting operations.
However, you don't get a good prettyprinter. Nice layout of the reformatted source code requires that language cues for block nesting (e.g., matching '{' ... '}', 'BEGIN' ... 'END' pairs, special keywords 'if', 'for', etc.) be used to drive the formatting and indentation. While one can guess what these elements are (as I just did), that's just a guess and in practice a human being needs to inspect the grammar to determine which things are cues and how to format each construct. (The default prettyprinter derived from the grammar makes such guesses).
DMS provides support for that problem in the form of prettyprinter declarations woven into the grammar to provide the formatter engineer quite a lot of control over the layout. (See this SO answer for detailed discussion: https://stackoverflow.com/a/5834775/120163) This produces
(our opinion) pretty good prettyprinters. And DMS does have an explicit grammar/formatter for full C++14. [EDIT Aug 2018: full C++17 in MS and GCC dialects)]
EDIT: rici's answer suggests that comments are difficult to handle. He's right, in the sense that you must handle them, and yes, it is hard to handle them if they are removed as whitespace while parsing. The essense of the problem is "removed as whitespace"; and goes away if you don't do that. DMS provides means to capture the comments (rather than ignoring them as whitespace) and attach them (automatically) to AST nodes. The decision as to which AST node captures the comments is handled in the lexer by declaring comments as "pre" (happening before a token) or "post"; this decision is heuristic on the part of the grammer/lexer engineer, but works actually pretty well. The token with comments is passed to the parser, which builds an AST node from it. With comments attached to AST nodes, the prettyprinter can re-generate them, too.

What is the difference between indentation and pretty printing used in c++?

Indentation and pretty printing are used to improve the clarity and readability of a program. Both these styles use spaces. And how can I distinguish between them.
Pretty Printing is a method used to make your code easily readable and understandable. In Wikipedia, Pretty Printing is explained as follows
Prettyprint (or pretty-print) is the application of any of various
stylistic formatting conventions to text files, such as source code,
markup, and similar kinds of content. These formatting conventions can
adjust positioning and spacing (indent style), add color and contrast
(syntax highlighting), adjust size, and make similar modifications
intended to make the content easier for people to view, read, and
understand. Prettyprinters for programming language source code are
sometimes called code beautifiers or syntax highlighters.
Now lets see what Indentation is
In the written form of many languages, an indentation is an empty
space at the beginning of a line to signal the start of a new
paragraph. Many computer languages have adopted this technique to
designate "paragraphs" or other logical blocks in the program.
In computer programming languages, indentation is used to format
program source code to improve readability. Indentation is generally
only of use to programmers; compilers and interpreters rarely care how
much whitespace is present in between programming statements.
From these, one can understand that indentation is a way of Implementing Pretty Printing.

Automatic refactoring

Please suggest a tool that could automate replacing like:
Mutex staticMutex = Mutex(m_StaticMutex.Handle());
staticMutex.Wait();
to
boost::unique_lock<boost::mutex> lock(m_StaticMutex);
As you see, the arguments must be taken into account. Is there a way simpler than regular expressions?
If you can do this with a modest amount of manual work (even including "search and replace") then this answer isn't relevant.
If the code varies too much (indentation, comments, different variable names) and there's a lot of these, you might need a Program Transformation tool. Such tools tend to operate on program representations such as abstract syntax trees, and consequently are not bother by layout or whitespace or even numbers that are spelled differently because of radix, but actually have the same value.
Our DMS Software Reengineering Toolkit is one of these, and has a C++ Front End.
You'd need to give it a rewrite rule something like the following:
domain Cpp; -- tell DMS to use the C++ front end for parsing and prettyprinting
rule replace_mutex(i:IDENTIFIER):statements -> statements
"Mutex \i = Mutex(m_StaticMutex.Handle());
\i.Wait();" =>
"boost::unique_lock<boost::mutex> lock(m_StaticMutex);";
The use of the metavariable \i in both places will ensure that the rule only fires if the name is exactly the same in both places.
It isn't clear to me precisely what you are trying to accomplish; it sort of looks like you want to replace each private mutex with one global one, but I'm not a boost expert. If you tried to do that, I'd expect your program to behave differently.
If those lines appear frequently in your code, similar formatted, just with different variable names, but not "too" frequently (<200~300 times), I would suggest you use an editor with record-replay capabilities (for example Visual Studio under Windows). Record the steps to replace the 2 lines by the new one (but keep the variable name). Then repeat "search for Mutex" - "replay macro" as often as you need it.
Of course, this specific case should be also solvable for all occurences at once by any text editor with good "Find-and-Replace in Files" capabilities.

How should I go about building a simple LR parser?

I am trying to build a simple LR parser for a type of template (configuration) file that will be used to generate some other files. I've read and read about LR parsers, but I just can't seem to understand it! I understand that there is a parse stack, a state stack and a parsing table. Tokens are read onto the parse stack, and when a rule is matched then the tokens are shifted or reduced, depending on the parsing table. This continues recursively until all of the tokens are reduced and the parsing is then complete.
The problem is I don't really know how to generate the parsing table. I've read quite a few descriptions, but the language is technical and I just don't understand it. Can anyone tell me how I would go about this?
Also, how would I store things like the rules of my grammar?
http://codepad.org/oRjnKacH is a sample of the file I'm trying to parse with my attempt at a grammar for its language.
I've never done this before, so I'm just looking for some advice, thanks.
In your study of parser theory, you seem to have missed a much more practical fact: virtually nobody ever even considers hand writing a table-driven, bottom-up parser like you're discussing. For most practical purposes, hand-written parsers use a top-down (usually recursive descent) structure.
The primary reason for using a table-driven parser is that it lets you write a (fairly) small amount of code that manipulates the table and such, that's almost completely generic (i.e. it works for any parser). Then you encode everything about a specific grammar into a form that's easy for a computer to manipulate (i.e. some tables).
Obviously, it would be entirely possible to do that by hand if you really wanted to, but there's almost never a real point. Generating the tables entirely by hand would be pretty excruciating all by itself.
For example, you normally start by constructing an NFA, which is a large table -- normally, one row for each parser state, and one column for each possible input. At each cell, you encode the next state to enter when you start in that state, and then receive that input. Most of these transitions are basically empty (i.e. they just say that input isn't allowed when you're in that state). Note: since the valid transitions are so sparse, most parser generators support some way of compressing these tables, but that doesn't change the basic idea).
You then step through all of those and follow some fairly simple rules to collect sets of NFA states together to become a state in the DFA. The rules are simple enough that it's pretty easy to program them into a computer, but you have to repeat them for every cell in the NFA table, and do essentially perfect book-keeping to produce a DFA that works correctly.
A computer can and will do that quite nicely -- for it, applying a couple of simple rules to every one of twenty thousand cells in the NFA state table is a piece of cake. It's hard to imagine subjecting a person to doing the same though -- I'm pretty sure under UN guidelines, that would be illegal torture.
The classic solution is the lex/yacc combo:
http://dinosaur.compilertools.net/yacc/index.html
Or, as gnu calls them - flex/bison.
edit:
Perl has Parse::RecDescent, which is a recursive descent parser, but it may work better for simple work.
you need to read about ANTLR
I looked at the definition of your fileformat, while I am missing some of the context why you would want specifically a LR parser, my first thought was why not use existing formats like xml, or json. Going down the parsergenerator route usually has a high startup cost that will not pay off for the simple data that you are looking to parse.
As paul said lex/yacc are an option, you might also want to have a look at Boost::Spirit.
I have worked with neither, a year ago wrote a much larger parser using QLALR by the Qt/Nokia people. When I researched parsers this one even though very underdocumented had the smallest footprint to get started (only 1 tool) but it does not support lexical analysis. IIRC I could not figure out C++ support in ANTLR at that time.
10,000 mile view: In general you are looking at two components a lexer that takes the input symbols and turns them into higher order tokens. To work of the tokens your grammar description will state rules, usually you will include some code with the rules, this code will be executed when the rule is matched. The compiler generator (e.g. yacc) will take your description of the rules and the code and turn it into compilable code. Unless you are doing this by hand you would not be manipulating the tables yourself.
Well you can't understand it like
"Function A1 does f to object B, then function A2 does g to D etc"
its more like
"Function A does action {a,b,c,d,e,f,g,h,i,j,k,l,m,n,o or p, or no-op} and shifts/reduces a certain count to objects {1-1567} at stack head of type {B,C,D,E,F,or G} and its containing objects up N levels which may have types {H,I,J,K or L etc} in certain combinations according to a rule list"
It really does need a data table (or code generated from a data table like thing, like a set of BNF grammar data) telling the function what to do.
You CAN write it from scratch. You can also paint walls with eyelash brushes. You can interpret the data table at run-time. You can also put Sleep(1000); statements in your code every other line. Not that I've tried either.
Compilers is complex. Hence compiler generators.
EDIT
You are attempting to define the tokens in terms of content in the file itself.
I assume the reason you "don't want to use regexes" is that you want to be able to access line number information for different tokens within a block of text and not just for the block of text as a whole. If line numbers for each word are unnecessary, and entire blocks are going to fit into memory, I'd be inclined to model the entire bracketed block as a token, as this may increase processing speed. Either way you'll need a custom yylex function. Start by generating one with lex with fixed markers "[" and "]" for content start and end, then freeze it and modify it to take updated data about what markers to look for from the yacc code.

What are alternatives to regexes for syntax highlighting?

While editing this and that in Vim, I often find that its syntax highlighting (for some filetypes) has some defects. I can't remember any examples at the moment, but someone surely will. Usually, it consists of strings badly highlighted in some cases, some things with arithmetic and boolean operators and a few other small things as well.
Now, vim uses regexes for that kinda stuff (its own flavour).
However, I've started to come across editors which, at first glance, have syntax highlighting better taken care of. I've always thought that regexes are the way to go for that kind of stuff.
So I'm wondering, do those editors just have better written regexes, or do they take care of that in some other way ? What ? How is syntax highlighting taken care of when you want it to be "stable" ?
And in your opinion what is the editor that has taken care it the best (in your editor of choice), and how did he do it (language-wise) ?
Edit-1: For example, editors like Emacs, Notepad2, Notepad++, Visual Studio - do you perchance know what mechanism they use for syn. high. ?
The thought that immediately comes to mind for what you'd want to use instead of regexes for syntax highlighting is parsing. Regexes have a lot of advantages, but as we see with vim's highlighting, there are limits. (If you look for threads about using regexes to analyze XML, you'll find extensive material on why regexes can't do what parsers do.)
Since what we want from syntax highlighting is for it to follow the syntactic structure of the language, which regexes can only approximate, you need to perform some level of real parsing to go beyond what regexes can do. A simple recursive descent lexer will probably do great for most languages, I'm thinking.
Some programming languages have a formal definition/specification written in Backus-Naur Form. All*) programming languages can be described in it. All you then need, is some kind of parser for the notation.
*) not verified
For instance, C's BNF definition is "only five pages long".
If you want accurate highlighting one needs real programming not regular expressions. RegExs are rarely the answer fir anything but trivial tasks. To do highlighting in a better way you need to write a simple parser. Parses basically have separate components that each can do something like identify and consume a quoted string or number literal. If said component when looking at it's given cursor can't consume what's underneath it does nothing. From that you can easily parse or highlight fairly simply and easily.
Given something like
static int field = 123;
• The first macher would skip the whitespace before "static". The keyword, literal etc matchers would do nothing because handling whitespace is not their thing.
• The keyword matched when positioned over "static" would consume that. Because "s" is not a digit the literal matched does nothing. The whitespace skipper does nothing as well because "s" is not a whitespace character.
Naturally your loop continues to advance the cursor over the input string until the end is reached. The ordering of your matchers is of course important.
This approach is both flexible in that it handles syntactically incorrect fragments and is also easy to extend and reuse individual matchers to support highlighting of other languages...
I suggest the use of REs for syntax highlighting. If it's not working properly, then your RE isn't powerful or complicated enough :-) This is one of those areas where REs shine.
But given that you couldn't supply any examples of failure (so we can tell you what the problem is) or the names of the editors that do it better (so we can tell you how they do it), there's not a lot more we'll be able to give you in an answer.
I've never had any trouble with Vim with the mainstream languages and I've never had a need to use weird esoteric languages, so it suits my purposes fine.