Trying to skip spaces in JavaCC - regex

I am writing a lexical analyzer and parser using the Java Compiler Compiler, and I have a problem with SKIP not working on spaces. Newlines, tabs, and comments are skipped just fine, but spaces are not. I know that SKIP doesn't work inside tokens, but I don't understand why I am only having this problem with spaces and not with anything else I have tried to skip.
This is what my SKIP specification looks like:
SKIP: {
< " " | "\t" | "\r" | "\n" | "\r\n" > //White space
| <"//" (~["\n","\r"])* ("\n"|"\r"|"\r\n") > //Single-line comments
| <"/*"(~["/"])* "*""/" > //Multi-line comments
}
Then, later on, I have various tokens. Below are a few of them for examples.
<DEFAULT> TOKEN : {
< COMMAND : "list" > : IN_LIST_COMMAND
| < ID : (["A"-"Z","a"-"z"])+(["A"-"Z","a"-"z","0"-"9"])* >
| < NUMBER : (["0"-"9"])+ >
...
These work just as I would like, except for when there is a space directly after one of them. For instance, if I tried to give
list [
to the parser, it would give me an TokenManagerError because "list" has a space following it.
Note: I have read through every scrap of documentation on JavaCC I can find, both from the main JavaCC website and from other sources. I have also searched for similar questions on StackOverflow and other sites, and found nothing that answers my question.

The problem is that the SKIP production only applies in the DEFAULT state. After the keyword "list" the lexer switches to another state. From the JavaCC documentation of the JavaCC Grammar File, we see
There is a standard lexical state called "DEFAULT". If the lexical
state list is omitted, the regular expression production applies to
the lexical state "DEFAULT".
The fix is to specify that the production applies in all lexical states. This is done by writing
<*> SKIP: {
< " " | "\t" | "\r" | "\n" | "\r\n" >
| <"//" (~["\n","\r"])* ("\n"|"\r"|"\r\n") >
| <"/*"(~["/"])* "*""/" >
}
which means that the regular expression production applies in all states.

Related

VIM, Automatic formatting, Code-Guidelines, C++

I want to be able to automatically format code for the following rules using vim:
Rule 1): If expressions which are must be indeneted with 3 spaces. Example:
if(a &&
b)
(Note: b has three space-indent relative to the parent if, note that current vim behavior is 4)
Rule 2): parameters separated by space. Example:
function_call(a, b, c);
Rule 3): No space between assignment operators. Example:
int a=x;
Rule 4): Reference/dereference operator is attached to variable name not type. Example:
int &x = b;
Where possible, I want vim to do this stuff automatically as I am typing, however if this not possible, identifying formatting that is counter to the above rules (by marking them as errors) will also be helpful.
You can set auto-indentation rules in a custom indent file. Check out examples in the "indent" directory, somewhere like /usr/share/vim/vim74/indent, or in the Vim source code distribution.
You can set error highlighting rules in a custom syntax file. Find examples in the "syntax" directory, somewhere like /usr/share/vim/vim74/syntax, or again in the Vim source code distribution. Here's an example for JSON files:
" Syntax: Decimals smaller than one should begin with 0 (so .1 should be 0.1).
syn match jsonNumError "\:\#<=[[:blank:]\r\n]*\zs\.\d\+"
If you want to actually re-format code automatically as you go you might need a special plugin like vim-autoformat and/or an external tool like ClangFormat.
Regarding indenting, and so on, check the options :h 'sw', :h 'cindent', :h 'cinoptions'...
Regarding where spaces and newlines shall be inserted,
For code already typed, clang-format is indeed the best way to go to reformat code. There is a plugin for vim.
For snippets, brackets and so on, lately I've worked on a plugin aimed at formatting text inserted by other plugins. Excesivelly inspired, I'm named the core plugin lh-style. It's used by mu-template (my snippet/templating plugin), and lh-brackets.
For other stuff you'll want to reformat on the fly, it'll be a little bit more complex. May be lh-style could help, I don't know, I haven't given much though on the subject yet.
For instance, outside comments and strings, = shall be expanded into :
itself after a [ (lamdbas),
<BS>=<space>, after =, >, <, ! followed by a space
<space>=<space> otherwise
EDIT: I got it all wrong, it does exactly the contrary of what you're looking for.
It'd be something like:
" ftplugin/c/mymappings.vim
function! s:InsertExpr(char) abort
let col = col('.')
let line = getline('.')
let syn = synIDattr(synID(line('.'),col-1,1),'name')
if syn =~? 'comment\|string\|character\|doxygen'
return a:key
endif
let lcut = getline('.')[: col-2]
let before =
\ lcut =~ '[=<>!] $' ? "\<bs>"
\ : lcut =~ "[=<>![ \t\n]$" ? ''
\ : ' '
let after = line[col-1] =~ "[ \t\n\\]]" ? '' : ' '
return before.a:char.after
endfunction
inoremap <buffer> <expr> = <sid>InsertExpr('=')
inoremap <buffer> <expr> < <sid>InsertExpr('<')
inoremap <buffer> <expr> > <sid>InsertExpr('>')

Vim - regex for changing bool variable checking

I am working on a C project an I want to change all bool-variable checking from
if(!a)
to
if(a == false)
in order to make the code easier to read(I want to do the same with while statements).
Anyway I'm using the following regex, which searches for an exclamation mark followed by a lowercase character and for the last closing parenthesis on the line.
%s/\(.*\)!\([a-z]\)\(.*\))\([^)]+\)/\1\2\3 == false)\4/g
I'm sorry for asking you to look over it but i can't understand why it would fail.
Also, is there an easier way of solving this problem and of using vim regex in general?
One solution should be this one:
%s/\(.*\)(\(\s*\)!\(\w\+\))/\1(\3 == false)/gc
Here, we do the following:
%s/\(.*\)(\(\s*\)!\(\w\+\))/\1(\3 == false)/gc
\--+-/|\--+--/|\---+--/|
| | | | | finally test for a single `)`.
| | | | (\3): then for one or more word characters (the var name).
| | | the single `!`
| | (\2): then for any amount of white space before the `!`
| the single `(`
(\1): test for any characters before a single `(`
Then, it's replaced by the first, third pattern, and then appends the text == false, opening and closing the parentheses as needed.
To do this in vim, you could use the following:
%s/\(if(\)!\([^)]\+\)/\1\2==false/c
make sure that only if(!var)-constructs are matched, you could change that to while for the next task
c asks for confirmation for every occurence
As #Kent said this is not a small undertaking. However for the simple case of just if(!a) it can be done.
:%s/\<if(\zs!\(\k\+\)\ze)/\1 == false/c
Explanation:
Start by making sure if is at a word bound by \<. This ensures it isn't part of some function name.
\zs and \ze set the start and end of the match respectively.
Capture the variable via the keyword class \k (\w works too) ending up with \(\k\+\)
For extra safety use the c flag to confirm each substation.
Thoughts:
This will need to be updated for other constructs, e.g. while
May need to make alterations for extra white-space, e.g. \<if\s*(\s*\zs!\(\k\+\)\ze\s*)
May want to use [a-z0-9_] instead of \k or \w to avoid capturing macros
There are instances where you may not have a construct: foo = !a && b;
This only handles the false cases. Doing a == true may be far trickier
Depending on your case it might be safest to just do the following:
:%s/!\([a-z0-9]\+\)/\1 == false/gc
On top of the answers already presented, I would say that the code does not smell like it needs refactoring. For a global regex replacement, the primary problem is to find
all bool-variables
and distinguish them from pointers, etc.

ANTLR3 String Literals and Disallowing Nested Comments

I've recently been tasked with writing an ANTLR3 grammar for a fictional language. Everything else seems fine, but I've a couple of minor issues which I could do with some help with:
1) Comments are between '/*' and '*/', and may not be nested. I know how to implement comments themselves ('/*' .* '*/'), but how would I go about disallowing their nesting?
2) String literals are defined as any sequence of characters (except for double quotes and new lines) in between a pair of double quotes. They can only be used in an output statement. I attempted to define this thus:
output : OUTPUT (STRINGLIT | IDENT) ;
STRINGLIT : '"' ~('\r' | '\n' | '"')* '"' ;
For some reason, however, the parser accepts
OUTPUT "Hello,
World!"
and tokenises it as "Hello, \nWorld. Where the exclamation mark or closing " went I have no idea. Something to do with whitespace maybe?
WHITESPACE : ( '\t' | ' ' | '\n' | '\r' | '\f' )+ { $channel = HIDDEN; } ;
Any advice would be much appreciated - thanks for your time! :)
The form you wrote already disallows nested comments. The token will stop at the first instance of */, even if multiple /* sequences appeared in the comment. To allow nested comments you have to write a lexer rule to specifically treat the nesting.
The problem here is STRINGLIT does not allow a string to be split across multiple lines. Without seeing the rest of your lexer rules, I cannot tell you how this will be tokenized, but it's clear from the STRINGLIT rule you gave that the sample input is not a valid string.
NOTE: Your input given in the original question was not clear, so I reformatted it in an attempt to show the exact input you were using. Can you verify that my edit properly represents the input?

empty rule in ocamlyacc

I have the following lexer rules:
let ws = [' ' '\t' '\n']+
...
| ws {Printf.printf "%s" (Lexing.lexeme lexbuf); WS(Lexing.lexeme lexbuf)}
And the following parser rules:
%token <string> WORD WS
cs : LSQRB wsornon choices wsornon RSQRB {$2}
;
wsornon : /* nothing */
| WS {$1}
;
choices : choice {$1}
| choices choice {$2}
;
choice : CHOICE LCURLYB mainbody RCURLYB {$3}
;
I basically want to get wsornon to match with whitespace or nothing. But cs gives syntax errors for the case without whitespace (which corresponds to the empty rule).
Am I missing something?
Even if you parse the empty stream, you should have a production rule:
wsornon:
| { something for nothing }
| WS { something for whitespace }
Note that menhir has an OPTION parametrized rule that is just fine for this kind of things, so that you don't have to write another rule for that. In fact OPTION(foo) return a production of type bar option if rule foo returns something of type bar, while you're going to ignore them anyway, so that's a bit of a different situation.
If you want to ignore whitespace, why don't you drop it altogether at the lexer step? Is it useful somewhere else in your grammar? I'd rather hack the lexer a bit to have some whitespace token just after some tokens where I know they're important than have them pollute my whole grammar. Of course, menhir allows to define parametrized rules that could help with that (example below untested):
ws(rule):
| LIST(WS) result = rule LIST(WS) { result }

How can I use a regular expression to match something in the form 'stuff=foo' 'stuff' = 'stuff' 'more stuff'

I need a regexp to match something like this,
'text' | 'text' | ... | 'text'(~text) = 'text' | 'text' | ... | 'text'
I just want to divide it up into two sections, the part on the left of the equals sign and the part on the right. Any of the 'text' entries can have "=" between the ' characters though. I was thinking of trying to match an even number of 's followed by a =, but I'm not sure how to match an even number of something.. Also note I don't know how many entries on either side there could be. A couple examples,
'51NL9637X33' | 'ISL6262ACRZ-T' | 'QFN'(~51NL9637X33) = '51NL9637X33' | 'ISL6262ACRZ-T' | 'INTERSIL' | 'QFN7SQ-HT1_P49' | '()'
Should extract,
'51NL9637X33' | 'ISL6262ACRZ-T' | 'QFN'(~51NL9637X33)
and,
'51NL9637X33' | 'ISL6262ACRZ-T' | 'INTERSIL' | 'QFN7SQ-HT1_P49' | '()'
'227637' | 'SMTU2032_1' | 'SKT W/BAT'(~227637) = '227637' | 'SMTU2032_1' | 'RENATA' | 'SKT28_5X16_1-HT5_4_P2' | '()' :SPECIAL_A ='BAT_CR2032', PART_NUM_A='202649'
Should extract,
'227637' | 'SMTU2032_1' | 'SKT W/BAT'(~227637)
and,
'227637' | 'SMTU2032_1' | 'RENATA' | 'SKT28_5X16_1-HT5_4_P2' | '()' :SPECIAL_A ='BAT_CR2032', PART_NUM_A='202649'
Also note the little tilda bit at the end of the first section is optional, so I can't just look for that.
Actually I wouldn't use a regex for that at all. Assuming your language has a split operation, I'd first split on the | character to get a list of:
'51NL9637X33'
'ISL6262ACRZ-T'
'QFN'(~51NL9637X33) = '51NL9637X33'
'ISL6262ACRZ-T'
'INTERSIL'
'QFN7SQ-HT1_P49'
'()'
Then I'd split each of them on the = character to get the key and (optional) value:
'51NL9637X33' <null>
'ISL6262ACRZ-T' <null>
'QFN'(~51NL9637X33) '51NL9637X33'
'ISL6262ACRZ-T' <null>
'INTERSIL' <null>
'QFN7SQ-HT1_P49' <null>
'()' <null>
You haven't specified why you think a regex is the right tool for the job but most modern languages also have a split capability and regexes aren't necessarily the answer to every requirement.
I agree with paxdiablo in that regular expressions might not be the most suitable tool for this task, depending on the language you are working with.
The question "How do I match an even number of characters?" is interesting nonetheless, and here is how I'd do it in your case:
(?:'[^']*'|[^=])*(?==)
This expression matches the left part of your entry by looking for a ' at its current position. If it finds one, it runs forward to the next ' and thereby only matching an even number of quotes. If it does not find a ' it matches anything that is not an equal sign and then assures that an equal sign follows the matched string. It works because the regex engine evaluates OR constructs from left to right.
You could get the left and right parts in two capturing groups by using
((?:'[^']*'|[^=])*)=(.*)
I recommend http://gskinner.com/RegExr/ for tinkering with regular expressions. =)
As paxdiablo said, you almost certainly don't want to use a regex here. The split suggestion isn't bad; I myself would probably use a parser here—there's a lot of structure to exploit. The idea here is that you formally specify the syntax of what you have—sort of like what you gave us, only rigorous. So, for instance: a field is a sequence of non-single-quote characters surrounded by single quotes; a fields is any number of fields separated by white space, a |, and more white space; a tilde is non-right-parenthesis characters surrounded by (~ and ); and an expr is a fields, optional whitespace, an optional tilde, a =, optional whitespace, and another fields. How you express this depends on the language you are using. In Haskell, for instance, using the Parsec library, you write each of those parsers as follows:
import Text.ParserCombinators.Parsec
field :: Parser String
field = between (char '\'') (char '\'') $ many (noneOf "'\n")
tilde :: Parser String
tilde = between (string "(~") (char ')') $ many (noneOf ")\n")
fields :: Parser [String]
fields = field `sepBy` (try $ spaces >> char '|' >> spaces)
expr :: Parser ([String],Maybe String,[String])
expr = do left <- fields
spaces
opt <- optionMaybe tilde
spaces >> char '=' >> spaces
right <- fields
(char '\n' >> return ()) <|> eof
return (left, opt, right)
Understanding precisely how this code works isn't really important; the basic idea is to break down what you're parsing, express it in formal rules, and build it back up out of the smaller components. And for something like this, it'll be much cleaner than a regex.
If you really want a regex, here you go (barely tested):
^\s*('[^']*'((\s*\|\s*)'[^'\n]*')*)?(\(~[^)\n]*\))?\s*=\s*('[^']*'((\s*\|\s*)'[^'\n]*')*)?\s*$
See why I recommend a parser? When I first wrote this, I got at least two things wrong which I picked up (one per test), and there's probably something else. And I didn't insert capturing groups where you wanted them because I wasn't sure where they'd go. Now yes, I could have made this more readable by inserting comments, etc. And after all, regexen have their uses! However, the point is: this is not one of them. Stick with something better.