empty rule in ocamlyacc - ocaml

I have the following lexer rules:
let ws = [' ' '\t' '\n']+
...
| ws {Printf.printf "%s" (Lexing.lexeme lexbuf); WS(Lexing.lexeme lexbuf)}
And the following parser rules:
%token <string> WORD WS
cs : LSQRB wsornon choices wsornon RSQRB {$2}
;
wsornon : /* nothing */
| WS {$1}
;
choices : choice {$1}
| choices choice {$2}
;
choice : CHOICE LCURLYB mainbody RCURLYB {$3}
;
I basically want to get wsornon to match with whitespace or nothing. But cs gives syntax errors for the case without whitespace (which corresponds to the empty rule).
Am I missing something?

Even if you parse the empty stream, you should have a production rule:
wsornon:
| { something for nothing }
| WS { something for whitespace }
Note that menhir has an OPTION parametrized rule that is just fine for this kind of things, so that you don't have to write another rule for that. In fact OPTION(foo) return a production of type bar option if rule foo returns something of type bar, while you're going to ignore them anyway, so that's a bit of a different situation.
If you want to ignore whitespace, why don't you drop it altogether at the lexer step? Is it useful somewhere else in your grammar? I'd rather hack the lexer a bit to have some whitespace token just after some tokens where I know they're important than have them pollute my whole grammar. Of course, menhir allows to define parametrized rules that could help with that (example below untested):
ws(rule):
| LIST(WS) result = rule LIST(WS) { result }

Related

VIM, Automatic formatting, Code-Guidelines, C++

I want to be able to automatically format code for the following rules using vim:
Rule 1): If expressions which are must be indeneted with 3 spaces. Example:
if(a &&
b)
(Note: b has three space-indent relative to the parent if, note that current vim behavior is 4)
Rule 2): parameters separated by space. Example:
function_call(a, b, c);
Rule 3): No space between assignment operators. Example:
int a=x;
Rule 4): Reference/dereference operator is attached to variable name not type. Example:
int &x = b;
Where possible, I want vim to do this stuff automatically as I am typing, however if this not possible, identifying formatting that is counter to the above rules (by marking them as errors) will also be helpful.
You can set auto-indentation rules in a custom indent file. Check out examples in the "indent" directory, somewhere like /usr/share/vim/vim74/indent, or in the Vim source code distribution.
You can set error highlighting rules in a custom syntax file. Find examples in the "syntax" directory, somewhere like /usr/share/vim/vim74/syntax, or again in the Vim source code distribution. Here's an example for JSON files:
" Syntax: Decimals smaller than one should begin with 0 (so .1 should be 0.1).
syn match jsonNumError "\:\#<=[[:blank:]\r\n]*\zs\.\d\+"
If you want to actually re-format code automatically as you go you might need a special plugin like vim-autoformat and/or an external tool like ClangFormat.
Regarding indenting, and so on, check the options :h 'sw', :h 'cindent', :h 'cinoptions'...
Regarding where spaces and newlines shall be inserted,
For code already typed, clang-format is indeed the best way to go to reformat code. There is a plugin for vim.
For snippets, brackets and so on, lately I've worked on a plugin aimed at formatting text inserted by other plugins. Excesivelly inspired, I'm named the core plugin lh-style. It's used by mu-template (my snippet/templating plugin), and lh-brackets.
For other stuff you'll want to reformat on the fly, it'll be a little bit more complex. May be lh-style could help, I don't know, I haven't given much though on the subject yet.
For instance, outside comments and strings, = shall be expanded into :
itself after a [ (lamdbas),
<BS>=<space>, after =, >, <, ! followed by a space
<space>=<space> otherwise
EDIT: I got it all wrong, it does exactly the contrary of what you're looking for.
It'd be something like:
" ftplugin/c/mymappings.vim
function! s:InsertExpr(char) abort
let col = col('.')
let line = getline('.')
let syn = synIDattr(synID(line('.'),col-1,1),'name')
if syn =~? 'comment\|string\|character\|doxygen'
return a:key
endif
let lcut = getline('.')[: col-2]
let before =
\ lcut =~ '[=<>!] $' ? "\<bs>"
\ : lcut =~ "[=<>![ \t\n]$" ? ''
\ : ' '
let after = line[col-1] =~ "[ \t\n\\]]" ? '' : ' '
return before.a:char.after
endfunction
inoremap <buffer> <expr> = <sid>InsertExpr('=')
inoremap <buffer> <expr> < <sid>InsertExpr('<')
inoremap <buffer> <expr> > <sid>InsertExpr('>')

Ocaml record matching

Given a basic record
type t = {a:string;b:string;c:string}
why does this code compile
let f t = match t with
{a;b;_} -> a
but this
let f t = match t with
{_;b;c} -> b
and
let f t = match t with
{a;_;c} -> c
does not? I'm asking this out of curiosity thus the obvious useless code examples.
The optional _ field must be the last field. This is documented as a language extension in Section 7.2
Here's the production for reference:
pattern ::= ...
∣ '{' field ['=' pattern] { ';' field ['=' pattern] } [';' '_' ] [';'] '}'
Because the latter two examples are syntactically incorrect. The syntax allows you to terminate your field name pattern with the underscore to notify the compiler that you're aware, that there are more fields than you are trying to match. It is used to suppress a warning (that is disabled by default). Here is what the OCaml manual says about it:
Optionally, a record pattern can be terminated by ; _ to convey the fact that not all fields of the record type are listed in the record pattern and that it is intentional. By default, the compiler ignores the ; _ annotation. If warning 9 is turned on, the compiler will warn when a record pattern fails to list all fields of the corresponding record type and is not terminated by ; _. Continuing the point example above,
If you want to match to a name without binding it to a variable, then you should use the following syntax:
{a=_; b; c}
E.g.,
let {a=_; b; c} = {a="hello"; c="cruel"; b="world"};;
val b : string = world
val c : string = cruel
To add to the answers by Jeffrey Scofield and ivg, what the erroneous examples are trying to achieve can in fact be achieved by using a different order of fields. Like so:
let f t = match t with
{b;c;_} -> b

Unexpected behaviour when parsing a string with optional Suffix in antlr4

I want to match multiple Functions to accept a comma-seperated List of placeholders and then the definition of a Unit, which is again seperated by a comma from the rest of the arguments. The text to parse would look like example 1: "produkt([F1],[F2],EURO_CENT)" or example 2:"produkt([F1],[F2],EURO)"
The grammar for this like i would expect it to work is this:
[...]
term: [...]
| 'produkt(' placeholder ',' placeholder ',' UNIT ')' #MultUnit
[...]
| placeholder #PlaceholderTwo
;
WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines
LBRACK: '[';
RBRACK: ']';
PLACE: TEXT+ NUMBER?;
placeholder: LBRACK PLACE+ RBRACK;
[..]
UNIT: TEXT (('_' TEXT)*)?;
TEXT: ('a' .. 'z' | 'A' .. 'Z')+;//[a-zA-Z]+;
[...]
With this grammar example 1 works as expected but example 2 gives me the error "line 1:18 mismatched input 'EURO' expecting UNIT". As i understand it this means that "EURO" itself does not match the pattern for UNIT but "EURO_CENT" does. I do not understand why this is the case because the pattern for UNIT says that the "_CENT" part is optional and only the first part is mandatory.
I also tried to give the UNIT some Prefix (in this case "Unit.") by changing the pattern for Unit to UNIT: 'Unit.' TEXT ('_' TEXT)*;
I changed the input string to "produkt([F1],[F2],Unit.EURO)" accordingly and this matches like a charme.
However the second approach is not very userfriendly since we have to add something (in our opinion) unnecessary to the input. So the question is: why does the first option not match as expected when the UNIT-String is a single word and is there a workaround for it?
The short answer is that PLACE and UNIT are mutually ambiguous for content that only matches TEXT. If the sample inputs are canonical, then change the PLACE rule to remove the ambiguity:
PLACE : TEXT+ NUMBER ;
Other possibilities include redefining PLACE as
PLACE : LBRACK TEXT+ NUMBER? RBRACK; // adjust other rules accordingly
adding a predicate to the rule:
PLACE : {followsLBRACK()}? TEXT+ NUMBER ;
and redefining UNIT:
UNIT: TEXT ( 'S' | ( '_' TEXT )+ ) ; // EUROS or EURO_CENT; similar for other units.
BTW, Antlr generally evaluates its grammars top-down, so mixing your rules as you have actually obfuscates the logic.

In Boost Spirit Qi, how do I match every character up to the next whitespace (with pre-skip)

Within a boost::spirit::qi grammar rule, how do you match a string of characters up to and excluding the next whitespace character, as defined by the supplied skipper?
For example, if the grammar is a set of attributes defined as:
attributeList = '(' >> *attribute >> ')';
attribute = (name : value) | (name : value units);
How do I match any character for name up to and excluding the first skipper character?
For example, for name, I would like to pre-skip, then match all characters except ':' or a skipper character. Do I have to instantiate a skipper within the grammar class, so that I can do something like:
name = +qi::char_ !(skipper | ':');
or can I access the existing supplied skipper object somehow and reference it directly? Also, I don't believe this needs to be wrapped in qi:: lexeme[]...
Thanks in advance for correcting the error of my ways
In order to do this, you'll need to suppress skipping, so qi::lexeme will have to be involved (or at least qi::no_skip, but you'd only use it to reimplement qi::lexeme), and to do precisely what you write you'll also need the skip parser. Then you could write
qi::lexeme[ +(qi::char_ - ':' - skipper) ]
The requirements seem rather lax, though. It is unusual to allow even non-printable characters such as the bell sign (ASCII 7) in identifiers. I don't know what exactly you're trying to do, so I can't answer such design questions for you, but to me it seems like there's a good chance you'd be happier with a more standard rule such as
qi::lexeme[ qi::alpha >> *qi::alnum ]
(for a very simple example. Your mileage may vary wrt underscores etc.)

ANTLR3 String Literals and Disallowing Nested Comments

I've recently been tasked with writing an ANTLR3 grammar for a fictional language. Everything else seems fine, but I've a couple of minor issues which I could do with some help with:
1) Comments are between '/*' and '*/', and may not be nested. I know how to implement comments themselves ('/*' .* '*/'), but how would I go about disallowing their nesting?
2) String literals are defined as any sequence of characters (except for double quotes and new lines) in between a pair of double quotes. They can only be used in an output statement. I attempted to define this thus:
output : OUTPUT (STRINGLIT | IDENT) ;
STRINGLIT : '"' ~('\r' | '\n' | '"')* '"' ;
For some reason, however, the parser accepts
OUTPUT "Hello,
World!"
and tokenises it as "Hello, \nWorld. Where the exclamation mark or closing " went I have no idea. Something to do with whitespace maybe?
WHITESPACE : ( '\t' | ' ' | '\n' | '\r' | '\f' )+ { $channel = HIDDEN; } ;
Any advice would be much appreciated - thanks for your time! :)
The form you wrote already disallows nested comments. The token will stop at the first instance of */, even if multiple /* sequences appeared in the comment. To allow nested comments you have to write a lexer rule to specifically treat the nesting.
The problem here is STRINGLIT does not allow a string to be split across multiple lines. Without seeing the rest of your lexer rules, I cannot tell you how this will be tokenized, but it's clear from the STRINGLIT rule you gave that the sample input is not a valid string.
NOTE: Your input given in the original question was not clear, so I reformatted it in an attempt to show the exact input you were using. Can you verify that my edit properly represents the input?