OCamllex matching beginning of line? - ocaml

I am messing around writing a toy programming language in OCaml with ocamllex, and was trying to make the language sensitive to indentation changes, python-style, but am having a problem matching the beginning of a line with ocamllex's regex rules. I am used to using ^ to match the beginning of a line, but in OCaml that is the string concat operator. Google searches haven't been turning up much for me unfortunately :( Anyone know how this would work?

I'm not sure if there is explicit support for zero-length matching symbols (like ^ in Perl-style regular expressions, which matches a position rather than a substring). However, you should be able to let your lexer turn newlines into an explicit token, something like this:
parser.mly
%token EOL
%token <int> EOLWS
% other stuff here
%%
main:
EOL stmt { MyStmtDataType(0, $2) }
| EOLWS stmt { MyStmtDataType($1 - 1, $2) }
;
lexer.mll
{
open Parser
exception Eof
}
rule token = parse
[' ' '\t'] { token lexbuf } (* skip other blanks *)
| ['\n'][' ']+ as lxm { EOLWS(String.length(lxm)) }
| ['\n'] { EOL }
(* ... *)
This is untested, but the general idea is:
Treat newlines as staetment 'starters'
Measure whitespace that immediately follows the newline and pass its length as an int
Caveat: you will need to preprocess your input to start with a single \n if it doesn't contain one.

Related

Regexp word between the braces [duplicate]

I need a regular expression to select all the text between two outer brackets.
Example:
START_TEXT(text here(possible text)text(possible text(more text)))END_TXT
^ ^
Result:
(text here(possible text)text(possible text(more text)))
I want to add this answer for quickreference. Feel free to update.
.NET Regex using balancing groups:
\((?>\((?<c>)|[^()]+|\)(?<-c>))*(?(c)(?!))\)
Where c is used as the depth counter.
Demo at Regexstorm.com
Stack Overflow: Using RegEx to balance match parenthesis
Wes' Puzzling Blog: Matching Balanced Constructs with .NET Regular Expressions
Greg Reinacker's Weblog: Nested Constructs in Regular Expressions
PCRE using a recursive pattern:
\((?:[^)(]+|(?R))*+\)
Demo at regex101; Or without alternation:
\((?:[^)(]*(?R)?)*+\)
Demo at regex101; Or unrolled for performance:
\([^)(]*+(?:(?R)[^)(]*)*+\)
Demo at regex101; The pattern is pasted at (?R) which represents (?0).
Perl, PHP, Notepad++, R: perl=TRUE, Python: PyPI regex module with (?V1) for Perl behaviour.
(the new version of PyPI regex package already defaults to this → DEFAULT_VERSION = VERSION1)
Ruby using subexpression calls:
With Ruby 2.0 \g<0> can be used to call full pattern.
\((?>[^)(]+|\g<0>)*\)
Demo at Rubular; Ruby 1.9 only supports capturing group recursion:
(\((?>[^)(]+|\g<1>)*\))
Demo at Rubular  (atomic grouping since Ruby 1.9.3)
JavaScript  API :: XRegExp.matchRecursive
XRegExp.matchRecursive(str, '\\(', '\\)', 'g');
Java: An interesting idea using forward references by #jaytea.
Without recursion up to 3 levels of nesting:
(JS, Java and other regex flavors)
To prevent runaway if unbalanced, with * on innermost [)(] only.
\((?:[^)(]|\((?:[^)(]|\((?:[^)(]|\([^)(]*\))*\))*\))*\)
Demo at regex101; Or unrolled for better performance (preferred).
\([^)(]*(?:\([^)(]*(?:\([^)(]*(?:\([^)(]*\)[^)(]*)*\)[^)(]*)*\)[^)(]*)*\)
Demo at regex101; Deeper nesting needs to be added as required.
Reference - What does this regex mean?
RexEgg.com - Recursive Regular Expressions
Regular-Expressions.info - Regular Expression Recursion
Mastering Regular Expressions - Jeffrey E.F. Friedl 1 2 3 4
Regular expressions are the wrong tool for the job because you are dealing with nested structures, i.e. recursion.
But there is a simple algorithm to do this, which I described in more detail in this answer to a previous question. The gist is to write code which scans through the string keeping a counter of the open parentheses which have not yet been matched by a closing parenthesis. When that counter returns to zero, then you know you've reached the final closing parenthesis.
You can use regex recursion:
\(([^()]|(?R))*\)
[^\(]*(\(.*\))[^\)]*
[^\(]* matches everything that isn't an opening bracket at the beginning of the string, (\(.*\)) captures the required substring enclosed in brackets, and [^\)]* matches everything that isn't a closing bracket at the end of the string. Note that this expression does not attempt to match brackets; a simple parser (see dehmann's answer) would be more suitable for that.
This answer explains the theoretical limitation of why regular expressions are not the right tool for this task.
Regular expressions can not do this.
Regular expressions are based on a computing model known as Finite State Automata (FSA). As the name indicates, a FSA can remember only the current state, it has no information about the previous states.
In the above diagram, S1 and S2 are two states where S1 is the starting and final step. So if we try with the string 0110 , the transition goes as follows:
0 1 1 0
-> S1 -> S2 -> S2 -> S2 ->S1
In the above steps, when we are at second S2 i.e. after parsing 01 of 0110, the FSA has no information about the previous 0 in 01 as it can only remember the current state and the next input symbol.
In the above problem, we need to know the no of opening parenthesis; this means it has to be stored at some place. But since FSAs can not do that, a regular expression can not be written.
However, an algorithm can be written to do this task. Algorithms are generally falls under Pushdown Automata (PDA). PDA is one level above of FSA. PDA has an additional stack to store some additional information. PDAs can be used to solve the above problem, because we can 'push' the opening parenthesis in the stack and 'pop' them once we encounter a closing parenthesis. If at the end, stack is empty, then opening parenthesis and closing parenthesis matches. Otherwise not.
(?<=\().*(?=\))
If you want to select text between two matching parentheses, you are out of luck with regular expressions. This is impossible(*).
This regex just returns the text between the first opening and the last closing parentheses in your string.
(*) Unless your regex engine has features like balancing groups or recursion. The number of engines that support such features is slowly growing, but they are still not a commonly available.
It is actually possible to do it using .NET regular expressions, but it is not trivial, so read carefully.
You can read a nice article here. You also may need to read up on .NET regular expressions. You can start reading here.
Angle brackets <> were used because they do not require escaping.
The regular expression looks like this:
<
[^<>]*
(
(
(?<Open><)
[^<>]*
)+
(
(?<Close-Open>>)
[^<>]*
)+
)*
(?(Open)(?!))
>
I was also stuck in this situation when dealing with nested patterns and regular-expressions is the right tool to solve such problems.
/(\((?>[^()]+|(?1))*\))/
This is the definitive regex:
\(
(?<arguments>
(
([^\(\)']*) |
(\([^\(\)']*\)) |
'(.*?)'
)*
)
\)
Example:
input: ( arg1, arg2, arg3, (arg4), '(pip' )
output: arg1, arg2, arg3, (arg4), '(pip'
note that the '(pip' is correctly managed as string.
(tried in regulator: http://sourceforge.net/projects/regulator/)
I have written a little JavaScript library called balanced to help with this task. You can accomplish this by doing
balanced.matches({
source: source,
open: '(',
close: ')'
});
You can even do replacements:
balanced.replacements({
source: source,
open: '(',
close: ')',
replace: function (source, head, tail) {
return head + source + tail;
}
});
Here's a more complex and interactive example JSFiddle.
Adding to bobble bubble's answer, there are other regex flavors where recursive constructs are supported.
Lua
Use %b() (%b{} / %b[] for curly braces / square brackets):
for s in string.gmatch("Extract (a(b)c) and ((d)f(g))", "%b()") do print(s) end (see demo)
Raku (former Perl6):
Non-overlapping multiple balanced parentheses matches:
my regex paren_any { '(' ~ ')' [ <-[()]>+ || <&paren_any> ]* }
say "Extract (a(b)c) and ((d)f(g))" ~~ m:g/<&paren_any>/;
# => (「(a(b)c)」 「((d)f(g))」)
Overlapping multiple balanced parentheses matches:
say "Extract (a(b)c) and ((d)f(g))" ~~ m:ov:g/<&paren_any>/;
# => (「(a(b)c)」 「(b)」 「((d)f(g))」 「(d)」 「(g)」)
See demo.
Python re non-regex solution
See poke's answer for How to get an expression between balanced parentheses.
Java customizable non-regex solution
Here is a customizable solution allowing single character literal delimiters in Java:
public static List<String> getBalancedSubstrings(String s, Character markStart,
Character markEnd, Boolean includeMarkers)
{
List<String> subTreeList = new ArrayList<String>();
int level = 0;
int lastOpenDelimiter = -1;
for (int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
if (c == markStart) {
level++;
if (level == 1) {
lastOpenDelimiter = (includeMarkers ? i : i + 1);
}
}
else if (c == markEnd) {
if (level == 1) {
subTreeList.add(s.substring(lastOpenDelimiter, (includeMarkers ? i + 1 : i)));
}
if (level > 0) level--;
}
}
return subTreeList;
}
}
Sample usage:
String s = "some text(text here(possible text)text(possible text(more text)))end text";
List<String> balanced = getBalancedSubstrings(s, '(', ')', true);
System.out.println("Balanced substrings:\n" + balanced);
// => [(text here(possible text)text(possible text(more text)))]
The regular expression using Ruby (version 1.9.3 or above):
/(?<match>\((?:\g<match>|[^()]++)*\))/
Demo on rubular
The answer depends on whether you need to match matching sets of brackets, or merely the first open to the last close in the input text.
If you need to match matching nested brackets, then you need something more than regular expressions. - see #dehmann
If it's just first open to last close see #Zach
Decide what you want to happen with:
abc ( 123 ( foobar ) def ) xyz ) ghij
You need to decide what your code needs to match in this case.
"""
Here is a simple python program showing how to use regular
expressions to write a paren-matching recursive parser.
This parser recognises items enclosed by parens, brackets,
braces and <> symbols, but is adaptable to any set of
open/close patterns. This is where the re package greatly
assists in parsing.
"""
import re
# The pattern below recognises a sequence consisting of:
# 1. Any characters not in the set of open/close strings.
# 2. One of the open/close strings.
# 3. The remainder of the string.
#
# There is no reason the opening pattern can't be the
# same as the closing pattern, so quoted strings can
# be included. However quotes are not ignored inside
# quotes. More logic is needed for that....
pat = re.compile("""
( .*? )
( \( | \) | \[ | \] | \{ | \} | \< | \> |
\' | \" | BEGIN | END | $ )
( .* )
""", re.X)
# The keys to the dictionary below are the opening strings,
# and the values are the corresponding closing strings.
# For example "(" is an opening string and ")" is its
# closing string.
matching = { "(" : ")",
"[" : "]",
"{" : "}",
"<" : ">",
'"' : '"',
"'" : "'",
"BEGIN" : "END" }
# The procedure below matches string s and returns a
# recursive list matching the nesting of the open/close
# patterns in s.
def matchnested(s, term=""):
lst = []
while True:
m = pat.match(s)
if m.group(1) != "":
lst.append(m.group(1))
if m.group(2) == term:
return lst, m.group(3)
if m.group(2) in matching:
item, s = matchnested(m.group(3), matching[m.group(2)])
lst.append(m.group(2))
lst.append(item)
lst.append(matching[m.group(2)])
else:
raise ValueError("After <<%s %s>> expected %s not %s" %
(lst, s, term, m.group(2)))
# Unit test.
if __name__ == "__main__":
for s in ("simple string",
""" "double quote" """,
""" 'single quote' """,
"one'two'three'four'five'six'seven",
"one(two(three(four)five)six)seven",
"one(two(three)four)five(six(seven)eight)nine",
"one(two)three[four]five{six}seven<eight>nine",
"one(two[three{four<five>six}seven]eight)nine",
"oneBEGINtwo(threeBEGINfourENDfive)sixENDseven",
"ERROR testing ((( mismatched ))] parens"):
print "\ninput", s
try:
lst, s = matchnested(s)
print "output", lst
except ValueError as e:
print str(e)
print "done"
You need the first and last parentheses. Use something like this:
str.indexOf('('); - it will give you first occurrence
str.lastIndexOf(')'); - last one
So you need a string between,
String searchedString = str.substring(str1.indexOf('('),str1.lastIndexOf(')');
because js regex doesn't support recursive match, i can't make balanced parentheses matching work.
so this is a simple javascript for loop version that make "method(arg)" string into array
push(number) map(test(a(a()))) bass(wow, abc)
$$(groups) filter({ type: 'ORGANIZATION', isDisabled: { $ne: true } }) pickBy(_id, type) map(test()) as(groups)
const parser = str => {
let ops = []
let method, arg
let isMethod = true
let open = []
for (const char of str) {
// skip whitespace
if (char === ' ') continue
// append method or arg string
if (char !== '(' && char !== ')') {
if (isMethod) {
(method ? (method += char) : (method = char))
} else {
(arg ? (arg += char) : (arg = char))
}
}
if (char === '(') {
// nested parenthesis should be a part of arg
if (!isMethod) arg += char
isMethod = false
open.push(char)
} else if (char === ')') {
open.pop()
// check end of arg
if (open.length < 1) {
isMethod = true
ops.push({ method, arg })
method = arg = undefined
} else {
arg += char
}
}
}
return ops
}
// const test = parser(`$$(groups) filter({ type: 'ORGANIZATION', isDisabled: { $ne: true } }) pickBy(_id, type) map(test()) as(groups)`)
const test = parser(`push(number) map(test(a(a()))) bass(wow, abc)`)
console.log(test)
the result is like
[ { method: 'push', arg: 'number' },
{ method: 'map', arg: 'test(a(a()))' },
{ method: 'bass', arg: 'wow,abc' } ]
[ { method: '$$', arg: 'groups' },
{ method: 'filter',
arg: '{type:\'ORGANIZATION\',isDisabled:{$ne:true}}' },
{ method: 'pickBy', arg: '_id,type' },
{ method: 'map', arg: 'test()' },
{ method: 'as', arg: 'groups' } ]
While so many answers mention this in some form by saying that regex does not support recursive matching and so on, the primary reason for this lies in the roots of the Theory of Computation.
Language of the form {a^nb^n | n>=0} is not regular. Regex can only match things that form part of the regular set of languages.
Read more # here
I didn't use regex since it is difficult to deal with nested code. So this snippet should be able to allow you to grab sections of code with balanced brackets:
def extract_code(data):
""" returns an array of code snippets from a string (data)"""
start_pos = None
end_pos = None
count_open = 0
count_close = 0
code_snippets = []
for i,v in enumerate(data):
if v =='{':
count_open+=1
if not start_pos:
start_pos= i
if v=='}':
count_close +=1
if count_open == count_close and not end_pos:
end_pos = i+1
if start_pos and end_pos:
code_snippets.append((start_pos,end_pos))
start_pos = None
end_pos = None
return code_snippets
I used this to extract code snippets from a text file.
This do not fully address the OP question but I though it may be useful to some coming here to search for nested structure regexp:
Parse parmeters from function string (with nested structures) in javascript
Match structures like:
matches brackets, square brackets, parentheses, single and double quotes
Here you can see generated regexp in action
/**
* get param content of function string.
* only params string should be provided without parentheses
* WORK even if some/all params are not set
* #return [param1, param2, param3]
*/
exports.getParamsSAFE = (str, nbParams = 3) => {
const nextParamReg = /^\s*((?:(?:['"([{](?:[^'"()[\]{}]*?|['"([{](?:[^'"()[\]{}]*?|['"([{][^'"()[\]{}]*?['")}\]])*?['")}\]])*?['")}\]])|[^,])*?)\s*(?:,|$)/;
const params = [];
while (str.length) { // this is to avoid a BIG performance issue in javascript regexp engine
str = str.replace(nextParamReg, (full, p1) => {
params.push(p1);
return '';
});
}
return params;
};
This might help to match balanced parenthesis.
\s*\w+[(][^+]*[)]\s*
This one also worked
re.findall(r'\(.+\)', s)

How to match a string with an opening brace { in C++ regex

I have about writing regexes in C++. I have 2 regexes which work fine in java. But these throws an error namely
one of * + was not preceded by a valid regular expression C++
These regexes are as follows:
regex r1("^[\s]*{[\s]*\n"); //Space followed by '{' then followed by spaces and '\n'
regex r2("^[\s]*{[\s]*\/\/.*\n") // Space followed by '{' then by '//' and '\n'
Can someone help me how to fix this error or re-write these regex in C++?
See basic_regex reference:
By default, regex patterns follow the ECMAScript syntax.
ECMAScript syntax reference states:
characters:
\character
description: character
matches: the character character as it is, without interpreting its special meaning within a regex expression.
Any character can be escaped except those which form any of the special character sequences above.
Needed for: ^ $ \ . * + ? ( ) [ ] { } |
So, you need to escape { to get the code working:
std::string s("\r\n { \r\nSome text here");
regex r1(R"(^\s*\{\s*\n)");
regex r2(R"(^\s*\{\s*//.*\n)");
std::string newtext = std::regex_replace( s, r1, "" );
std::cout << newtext << std::endl;
See IDEONE demo
Also, note how the R"(pattern_here_with_single_escaping_backslashes)" raw string literal syntax simplifies a regex declaration.

Flex RegEx to find string not starting with a pattern

I'm writing a lexer to scan a modified version of an INI file.
I need to recognize the declaration of variables, comments and strings (between double quotes) to be assigned to a variable. For example, this is correct:
# this is a comment
var1 = "string value"
I've successfully managed to recognize these tokens forcing the # at the begging of the comment regular expression and " at the end of the string regular expression, but I don't want to do this because later on, using Bison, the tokens I get are exactly # this is a comment and "string value". Instead I want this is a comment (without #) and string value (without ")
These are the regular expressions that I currently use:
[a-zA-Z][a-zA-Z0-9]* { return TOKEN_VAR_NAME; }
["][^\n\r]*["] { return TOKEN_STRING; }
[#][^\n\r]* { return TOKEN_COMMENT; }
Obviously there can be any number of white spaces, as well as tabs, inside the string, the comment and between the variable name and the =.
How could I achieve the result I want?
Maybe it will be easier if I show you a complete example of a correct input file and also the grammar rules I use with Flex and Bison.
Correct input file example:
[section1]
var1 = "string value"
var2 = "var1 = text"
# this is a comment
# var5 = "some text" this is also a valid comment
These are the regular expressions for the lexer:
"[" { return TOKEN::SECTION_START; }
"]" { return TOKEN::SECTION_END; }
"=" { return TOKEN::ASSIGNMENT; }
[#][^\n\r]* { return TOKEN::COMMENT; }
[a-zA-Z][a-zA-Z0-9]* { *m_yylval = yytext; return TOKEN::ID; }
["][^\n\r]*["] { *m_yylval = yytext; return TOKEN::STRING; }
And these are the syntax rules:
input : input line
| line
;
line : section
| value
| comment
;
section : SECTION_START ID SECTION_END { createNewSection($2); }
;
value : ID ASSIGNMENT STRING { addStringValue($1, $3); }
;
comment : COMMENT { addComment($1); }
;
To do that you have to treat " and # as different tokens (so they get scanned as individual tokens, different from the one you are scanning now) and use a %s or %x start condition to change the accepted regular patterns on reading those tokens with the scanner input.
This adds another drawback, that is, you will receive # as an individual token before the comment and " before and after the string contents, and you'll have to cope with that in your grammar. This will complicate your grammar and the scanner, so I have to discourage you to follow this approach.
There is a better solution, by writting a routine to unescape things and allow the scanner to be simpler by returning all the input string in yytext and simply
m_yylval = unescapeString(yytext); /* drop the " chars */
return STRING;
or
m_yylval = uncomment(yytext); /* drop the # at the beginning */
return COMMENT; /* return EOL if you are trying the exmample at the end */
in the yylex(); function.
Note
As comments are normally ignored, the best thing is to ignore using a rule like:
"#".* ; /* ignored */
in your flex file. This makes generated scanner not return and ignore the token just read.
Note 2
You probably don't have taken into account that your parser will allow you to introduce lines on the form:
var = "data"
in front of any
[section]
line, so you'll run into trouble trying to addStringvalue(...); when no section has been created. One possible solution is to modify your grammar to separate file in sections and force them to begin with a section line, like:
compilation: file comments ;
file: file section
| ; /* empty */
section: section_header section_body;
section_header: comments `[` ident `]` EOL
section_body: section_body comments assignment
| ; /* empty */
comments: comments COMMENT
| ; /* empty */
This has complicated by the fact that you want to process the comments. If you were to ignore them (with using ; in the flex scanner) The grammar would be:
file: empty_lines file section
| ; /* empty */
empty_lines: empty_lines EOL
| ; /* empty */
section: header body ;
header: '[' IDENT ']' EOL ;
body: body assignment
| ; /* empty */
assignment: IDENT '=' strings EOL
| EOL ; /* empty lines or lines with comments */
strings:
strings unit
| unit ;
unit: STRING
| IDENT
| NUMBER ;
This way the first thing allowed in your file is, apart of comments, that are ignored and blank space (EOLs are not considered blank space as we cannot ignore them, they terminate lines)

how to match only part of the expression to string in ocamllex

I have a simple ocamllex program where the rules section looks somewhat like this-
let digits= ['0'-'9']
let variables= 'X'|'Z'
rule addinlist = parse
|['\n'] {addinlist lexbuf;}
| "Inc" '(' variables+ '(' digits+ ')' ')' as ine { !inputstringarray.(!inputstringarrayi) <-ine;
inputstringarrayi := !inputstringarrayi +1;
addinlist lexbuf}
|_ as c
{ printf "Unrecognized character: %c\n" c;
addinlist lexbuf
}
| eof { () }
My question is suppose I want to match Inc(X(7)) such that I can convert it to my abstract syntax which is "Inc of var of int". I want my lexer to give me the separate strings while reading Inc(X(7)) such that I get "Inc" as a diff string (say inb) followed by "X" as a diff string (say inc) n followed by "7" as a diff string (say ind), so that i can play around with these strings inb, inc, & ind, instead of being stuck with a whole string ine, as is given by my program. How to go about this? I hope my question is clear

(F)Lex : get text not matched by rules / get default output

I've read a lot about (F)Lex so far, but I couldn't find an answer.
Actually I have 2 questions, and getting the answer for one would be enough.
I have strings like:
TOTO 123 CD123 RGF 32/FDS HGGH
For each token I find, I put it in a vector. For example, for this string, I get a vector like this:
vector = TOTO, whitespace, CD, 123, whitespace, RGF, whitespace, 32, FDS, whitespace, HGGH
The "/" does not match any rules, but still, i would like to put it in my vector when I reach it and get:
vector = TOTO, whitespace, CD, 123, whitespace, RGF, whitespace, 32, /, FDS, whitespace, HGGH
So my questions are:
1) Is there a possibility to modify the default action when an input does not match any rule? (instead of print on stdout ?)
2) If it is not possible, how to catch this ? because here, "/" is an example but it can be everything ( % , C, 3, Blabblabla, etc that does not match my rules), and I can't put
.* { else(); }
cause Flex uses the regex which matches the longest string. I would like that my rules to be "sorted", and ".*" would be the last, like changing the "preferences" of Flex.
Any idea ?
The usual way is to have a rule something like
. { do_something_with_extra_char(*yytext); }
at the END of your rules. This will match any single character (other than newline -- you need a rule that matches newline somewhere too) that doesn't match any other rule. If you have multiple unmatched characters, this rule will trigger multiple times, but generally that is fine.
EDIT: I think Chris Dodd's answer is better. Here are two alternative solutions.
One solution would be to use states. When you read a single unrecognized character, enter into a different state, and build up the unrecognized token.
%{
char str[1024];
int strUsed;
%}
%x UNRECOGNIZED
%%
{SOME_RULE} {/* do processing */ }
. {BEGIN(UNRECOGNIZED); str[0] = yytext[0]; strUsed = 1; }
<UNRECOGNIZED>{bad_input} { strcpy(str+strUsed, yytext); strUsed+=yyleng; }
<UNRECOGNIZED>{good_input} { str[strUsed] = 0; vector_add(str); BEGIN(INITIAL); }
This solution works well if it's easy to write a regular expression to match "bad" input. Another solution is to slowly build up bad characters until the next valid match:
%{
char str[1024];
int strUsed = 0;
void goodMatch() {
if(strUsed) {
str[strUsed] = 0;
vector_add(str);
strUsed = 0;
}
}
%}
%%
{SOME_RULE} { goodMatch(); /* do processing */ }
. {str[strUsed++] = yytext[0]; }
Note that this requires you to modify all existing rules to add in a call to function goodMatch.
Note for both solutions: if you use a statically sized buffer, you'll have to ensure you don't overflow it on the strcpy. If you end up using a dynamically sized string, you'll have to be sure to correctly clean up memory.