flex and bison: parse string without quotes - c++

I'm working on a mgf file parser (syntax: http://www.matrixscience.com/help/data_file_help.html) using flex + bison c + +.
I've realized the lexer (lex) and parser (yacc). But I've a problem that I can't solve : when I try to parse strings.
Important : there is no ' or " around the string.
Here is an example of input:
CHARGE=1+, 2+ and 3+
#some comments
BEGIN IONS
TITLE= Cmpd 1, +MSn(417.2108), 10.0 min //line 20
PEPMASS=417.21083 35173
CHARGE=3+
123.79550 20
285.16455 56
302.14335 146 1+
[other datas ...]
END IONS
BEGIN IONS
[an other one ... ]
Here the (minimal) lexer:
MGF_TOKEN_DEBUG is juste a macro to print a line
#define MGF_TOKEN_DEBUG(val) std::cout<<"token: "<<val<<std::endl
\n {
MGF_TOKEN_DEBUG("T_EOL");
return token::T_EOL;
}
^[#;!/][^\n]* {
MGF_TOKEN_DEBUG("T_COMMENT");
return token::T_COMMENT;
}
[[:space:]] {}
/** values **/
[0-9]+ {
MGF_TOKEN_DEBUG("V_INTEGER"<<" (="<<yytext<<")");
return token::V_INTEGER;
}
[0-9]+"."[0-9]* {
MGF_TOKEN_DEBUG("V_DOUBLE"<<" (="<<yytext<<")");
return token::V_DOUBLE;
}
[0-9]+("."[0-9]+)?[eE][+-][0-9]+ {
MGF_TOKEN_DEBUG("V_DOUBLE"<<" (="<<yytext<<")");
return token::V_DOUBLE;
}
"+" {
MGF_TOKEN_DEBUG("T_PLUS");
return token::T_PLUS;
}
"=" {
MGF_TOKEN_DEBUG("T_EQUALS");
return token::T_EQUALS;
}
"," {
MGF_TOKEN_DEBUG("T_COMA");
return token::T_COMA;
}
"and" {
MGF_TOKEN_DEBUG("T_AND");
return token::T_AND;
}
/*** keywords */
^"CHARGE" {
MGF_TOKEN_DEBUG("K_CHARGE");
return token::K_CHARGE;
}
^"TITLE" {
MGF_TOKEN_DEBUG("K_TITLE");
return token::K_TITLE;
}
[ others keywords ...]
/**** string : problem here **/
[A-Za-z]([:;,()A-Za-z0-9_.-]|[[:space]])* {
MGF_TOKEN_DEBUG("V_STRING"<<" (="<<yytext<<")");
return token::V_STRING;
}
And the (minimized) parser.
start : headerparams blocks T_END;
headerparams : /* empty */| headerparams headerparam;
headerparam : K_CHARGE T_EQUALS charge_list T_EOL | [others ...];
blocks : /* empty */ | blocks block;
block : T_BEGIN_IONS T_EOL blockparams ions T_END_IONS T_EOL| T_BEGIN_IONS T_EOL blockparams T_END_IONS T_EOL;
blockparam : K_CHARGE T_EQUALS charge T_EOL | K_TITLE T_EQUALS V_STRING T_EOL | [others...];
ion : number number T_EOL| number number charge T_EOL;
ions : ions ion| ion;
number : V_INTEGER | V_DOUBLE;
charge : V_INTEGER T_PLUS | V_INTEGER T_MINUS;
charge_list : charge| charge_list T_COMA charge | charge_list T_AND charge;
My problem is that I get the next token:
[...]
[line 20]
token: K_TITLE
token: T_EQUALS
token: v_STRING (= Cmpd)
token: V_INTEGER (= 1)
Error line 20: syntax error, unexpected integer, expecting end of line
I would like to have:
[...]
[line 20]
token: K_TITLE
token: T_EQUALS
token: v_STRING (Cmpd 1, +MSn (417.2108), 10.0 min)
token: T_EOL
If someone can help me ...
Edit #1
I've "solve" the problem using the concatenation of tokens:
lex:
[A-Za-z][^\n[:space:]+-=,]* {
MGF_TOKEN_DEBUG("V_STRING"<<" (="<<yytext<<")"))
return token::V_STRING;t
}
yacc:
string_st : V_STRING
| string_st V_STRING
| string_st number
| string_st T_COMA
| string_st T_PLUS
| string_st T_MINUS
;
blockparam : K_CHARGE T_EQUALS charge T_EOL | K_TITLE T_EQUALS string_st T_EOL | [others...];

if your string will alway start with some text TITLE and end with some text \n (new line char)
I would suggest you to use start conditions,
%x IN_TITLE
"TITLE" { /* return V_STRING of TITILE in c++ code */ BEGIN(IN_TITLE); }
<IN_TITLE>= { /* return T_EQUALS in c++ code */; }
<IN_TITLE>"\n" { BEGIN(INITIAL); }
<IN_TITLE>.* { MGF_TOKEN_DEBUG("V_STRING"<<" (="<<yytext<<")");return token::V_STRING; }
%x IN_TITLE defines the IN_TITLE state, and the pattern text TITLE will make it start. Once it's started, \n will have it go back to the initial state (INITIAL is predefined), and every other characters will just be consumed to V_STRING without any particular action.

Your basic problem is a simple typo:
[A-Za-z]([:;,()A-Za-z0-9_.-]|[[:space]])*
should be:
[A-Za-z]([:;,()A-Za-z0-9_.-]|[[:space:]])*
^
You don't actually need the | operator. The following is perfectly legal (but probably not what you want either; see below):
[A-Za-z][[:space:]:;,()A-Za-z0-9_.-]*
Once you fix that, you'll find that you have another problem: your keywords (TITLE, for example) will be lexed as STRING because the STRING pattern is longer. (In fact, since [:space:] includes \n, the STRING pattern will probably extend to the end of the input. You probably wanted [:blank:].)
I took a quick glance at the description of the format you're trying to parse, but it's not a very precise description. But it appears that parameter lines have the format:
^[[:alpha:]]+=.*$
Perhaps the :alpha: should be :alnum: or even something more permissive; as I said, the description wasn't very precise. What was clear is that:
The keyword is case-insensitive, so both TITLE and title will work identically, and
The = sign is obligatory and may not have a space on either side of it. (So your TITLE= line is not correct, but maybe it doesn't matter).
In order to not interfere with parsing of the data, you might want to make the above a single "token" whose value is the part after the = and whose type corresponds to the (case-normalized) keyword. Of course, each parameter-type may require an idiosyncratic value parser, which could only be achieved in flex by use of start conditions. In any event, you should think about the consequences of stray characters in the TITLE which are not part of the STRING pattern, and how you propose to deal with the resulting lexical error.
Your code does not make it clear how you communicate text values from your lexer to your parser. You need to be aware that the value of yytext is only safe inside of the lexer action for the token it corresponds to. The next call to the lexer will invalidate it, and bison parsers almost always have a lookahead token, so the lexer will have been called again before the token is processed. Consequently, you must copy yytext in order to pass it to the parser, and the parser needs to take ownership of the copy so that you don't end up leaking memory.

Related

How to code nextToken() function for a descent recursive parser LL(1)

I'm writting a recursive descent parser LL(1) in C++, but I have a problem because I don't know exactly how to get the next token. I know I have to use regular expressions for getting a terminal but I don't know how to get the largest next token.
For example, this lexical and this grammar (without left recursion, left factoring and without cycles):
//LEXICAL IN FLEX
TIME [0-9]+
DIRECTION UR|DR|DL|UL|U|D|L|R
ACTION A|J|M
%%
{TIME} {printf("TIME"); return (TIME);}
{DIRECTION} {printf("DIRECTION"); return (DIRECTION);}
{ACTION} {printf("ACTION"); return (ACTION);}
"~" {printf("RELEASED"); return (RELEASED);}
"+" {printf("PLUS_OP"); return (PLUS_OP);}
"*" {printf("COMB_OP"); return (COMB_OP);}
//GRAMMAR IN BISON
command : list_move PLUS_OP list_action
| list_move COMB_OP list_action
| list_move list_action
| list_move
| list_action
;
list_move: move list_move_prm
;
list_move_prm: move
| move list_move_prm
| ";"
;
list_action: ACTION list_action_prm
;
list_action_prm: PLUS_OP ACTION list_action_prm
| COMB_OP ACTION list_action_prm
| ACTION list_action_prm
| ";" //epsilon
;
move: TIME RELEASED DIRECTION
| RELEASED DIRECTION
| DIRECTION
;
I have a string that contains: "D DR R + A" it should validate it, but getting "DR" I have problems because "D" it's a token too, I don't know how to get "DR" instead "D".
There are a number of ways of hand-writing a tokenizer
you can use a recusive descent LL(1) parser "all the way down" -- rewrite your grammar in terms of single characters rather than tokens, and left factor it. Then your nextToken() routine becomes just getchar(). You'll end up with additional rules like:
TIME: DIGIT more_digits ;
more_digits: /* epsilon */ | DIGIT more_digits ;
DIGIT: '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' ;
DIRECTION: 'U' dir_suffix | 'D' dir_suffix | 'L' | 'R' ;
dir_suffix: /* epsilon */ | 'L' | 'R' ;
You can use regexes. Generally this means keeping around a buffer and reading the input into it. nextToken() then runs a series of regexes on the buffer, figuring out which one returns the longest token and returns that, advancing the buffer as needed.
You can do what flex does -- this is the buffer approach above, combined with building a single DFA that evaluates all of the regexes simultaneously. Running this DFA on the buffer then returns the longest token (based on the last accepting state reached before getting an error).
Note that in all cases, you'll need to consider how to handle whitespace as well. You can just ignore whitespace everywhere (FORTRAN style) or you can allow whitespace between some tokens, but not others (eg, not between the digits of TIME or within a DIRECTION, but everywhere else in the grammar). This can make the grammar much more complex (and the process of hand-writing the recursive descent parser much more tedious).
“I don't know exactly how to get the next token”
Your input comes from a stream (std::istream). You must write a get_token(istream) function (or a tokenizer class). The function must first discard white spaces, then read a character (or more if necessary) analyze it and returns the associated token. The following functions will help you achieve your goal:
ws – discards white-space.
istream::get – reads a character.
istream::putback – puts back in the stream a character (think “undo get”).
"I don't know how to get "DR" instead "D""
Both "D" and "DR" are words. Just read them as you would read a word: is >> word. You will also need a keyword to token map (see std::map). If you read the "D" string, you can ask the map what the associated token is. If not found, throw an exception.
A starting point (run it):
#include <iostream>
#include <iomanip>
#include <map>
#include <string>
enum token_t
{
END,
PLUS,
NUMBER,
D,
DR,
R,
A,
// ...
};
// ...
using keyword_to_token_t = std::map < std::string, token_t >;
keyword_to_token_t kwtt =
{
{"A", A},
{"D", D},
{"R", R},
{"DR", DR}
// ...
};
// ...
std::string s;
int n;
// ...
token_t get_token( std::istream& is )
{
char c;
std::ws( is ); // discard white-space
if ( !is.get( c ) ) // read a character
return END; // failed to read or eof
// analyze the character
switch ( c )
{
case '+': // simple token
return PLUS;
case '0': case '1': // rest of digits
is.putback( c ); // it starts with a digit: it must be a number, so put it back
is >> n; // and let the library to the hard work
return NUMBER;
//...
default: // keyword
is.putback( c );
is >> s;
if ( kwtt.find( s ) == kwtt.end() )
throw "keyword not found";
return kwtt[ s ];
}
}
int main()
{
try
{
while ( get_token( std::cin ) )
;
std::cout << "valid tokens";
}
catch ( const char* e )
{
std::cout << e;
}
}

Regular Expression: match and count "A" and stop after found "B" by counted times [duplicate]

I need a regular expression to select all the text between two outer brackets.
Example:
START_TEXT(text here(possible text)text(possible text(more text)))END_TXT
^ ^
Result:
(text here(possible text)text(possible text(more text)))
I want to add this answer for quickreference. Feel free to update.
.NET Regex using balancing groups:
\((?>\((?<c>)|[^()]+|\)(?<-c>))*(?(c)(?!))\)
Where c is used as the depth counter.
Demo at Regexstorm.com
Stack Overflow: Using RegEx to balance match parenthesis
Wes' Puzzling Blog: Matching Balanced Constructs with .NET Regular Expressions
Greg Reinacker's Weblog: Nested Constructs in Regular Expressions
PCRE using a recursive pattern:
\((?:[^)(]+|(?R))*+\)
Demo at regex101; Or without alternation:
\((?:[^)(]*(?R)?)*+\)
Demo at regex101; Or unrolled for performance:
\([^)(]*+(?:(?R)[^)(]*)*+\)
Demo at regex101; The pattern is pasted at (?R) which represents (?0).
Perl, PHP, Notepad++, R: perl=TRUE, Python: PyPI regex module with (?V1) for Perl behaviour.
(the new version of PyPI regex package already defaults to this → DEFAULT_VERSION = VERSION1)
Ruby using subexpression calls:
With Ruby 2.0 \g<0> can be used to call full pattern.
\((?>[^)(]+|\g<0>)*\)
Demo at Rubular; Ruby 1.9 only supports capturing group recursion:
(\((?>[^)(]+|\g<1>)*\))
Demo at Rubular  (atomic grouping since Ruby 1.9.3)
JavaScript  API :: XRegExp.matchRecursive
XRegExp.matchRecursive(str, '\\(', '\\)', 'g');
Java: An interesting idea using forward references by #jaytea.
Without recursion up to 3 levels of nesting:
(JS, Java and other regex flavors)
To prevent runaway if unbalanced, with * on innermost [)(] only.
\((?:[^)(]|\((?:[^)(]|\((?:[^)(]|\([^)(]*\))*\))*\))*\)
Demo at regex101; Or unrolled for better performance (preferred).
\([^)(]*(?:\([^)(]*(?:\([^)(]*(?:\([^)(]*\)[^)(]*)*\)[^)(]*)*\)[^)(]*)*\)
Demo at regex101; Deeper nesting needs to be added as required.
Reference - What does this regex mean?
RexEgg.com - Recursive Regular Expressions
Regular-Expressions.info - Regular Expression Recursion
Mastering Regular Expressions - Jeffrey E.F. Friedl 1 2 3 4
Regular expressions are the wrong tool for the job because you are dealing with nested structures, i.e. recursion.
But there is a simple algorithm to do this, which I described in more detail in this answer to a previous question. The gist is to write code which scans through the string keeping a counter of the open parentheses which have not yet been matched by a closing parenthesis. When that counter returns to zero, then you know you've reached the final closing parenthesis.
You can use regex recursion:
\(([^()]|(?R))*\)
[^\(]*(\(.*\))[^\)]*
[^\(]* matches everything that isn't an opening bracket at the beginning of the string, (\(.*\)) captures the required substring enclosed in brackets, and [^\)]* matches everything that isn't a closing bracket at the end of the string. Note that this expression does not attempt to match brackets; a simple parser (see dehmann's answer) would be more suitable for that.
This answer explains the theoretical limitation of why regular expressions are not the right tool for this task.
Regular expressions can not do this.
Regular expressions are based on a computing model known as Finite State Automata (FSA). As the name indicates, a FSA can remember only the current state, it has no information about the previous states.
In the above diagram, S1 and S2 are two states where S1 is the starting and final step. So if we try with the string 0110 , the transition goes as follows:
0 1 1 0
-> S1 -> S2 -> S2 -> S2 ->S1
In the above steps, when we are at second S2 i.e. after parsing 01 of 0110, the FSA has no information about the previous 0 in 01 as it can only remember the current state and the next input symbol.
In the above problem, we need to know the no of opening parenthesis; this means it has to be stored at some place. But since FSAs can not do that, a regular expression can not be written.
However, an algorithm can be written to do this task. Algorithms are generally falls under Pushdown Automata (PDA). PDA is one level above of FSA. PDA has an additional stack to store some additional information. PDAs can be used to solve the above problem, because we can 'push' the opening parenthesis in the stack and 'pop' them once we encounter a closing parenthesis. If at the end, stack is empty, then opening parenthesis and closing parenthesis matches. Otherwise not.
(?<=\().*(?=\))
If you want to select text between two matching parentheses, you are out of luck with regular expressions. This is impossible(*).
This regex just returns the text between the first opening and the last closing parentheses in your string.
(*) Unless your regex engine has features like balancing groups or recursion. The number of engines that support such features is slowly growing, but they are still not a commonly available.
It is actually possible to do it using .NET regular expressions, but it is not trivial, so read carefully.
You can read a nice article here. You also may need to read up on .NET regular expressions. You can start reading here.
Angle brackets <> were used because they do not require escaping.
The regular expression looks like this:
<
[^<>]*
(
(
(?<Open><)
[^<>]*
)+
(
(?<Close-Open>>)
[^<>]*
)+
)*
(?(Open)(?!))
>
I was also stuck in this situation when dealing with nested patterns and regular-expressions is the right tool to solve such problems.
/(\((?>[^()]+|(?1))*\))/
This is the definitive regex:
\(
(?<arguments>
(
([^\(\)']*) |
(\([^\(\)']*\)) |
'(.*?)'
)*
)
\)
Example:
input: ( arg1, arg2, arg3, (arg4), '(pip' )
output: arg1, arg2, arg3, (arg4), '(pip'
note that the '(pip' is correctly managed as string.
(tried in regulator: http://sourceforge.net/projects/regulator/)
I have written a little JavaScript library called balanced to help with this task. You can accomplish this by doing
balanced.matches({
source: source,
open: '(',
close: ')'
});
You can even do replacements:
balanced.replacements({
source: source,
open: '(',
close: ')',
replace: function (source, head, tail) {
return head + source + tail;
}
});
Here's a more complex and interactive example JSFiddle.
Adding to bobble bubble's answer, there are other regex flavors where recursive constructs are supported.
Lua
Use %b() (%b{} / %b[] for curly braces / square brackets):
for s in string.gmatch("Extract (a(b)c) and ((d)f(g))", "%b()") do print(s) end (see demo)
Raku (former Perl6):
Non-overlapping multiple balanced parentheses matches:
my regex paren_any { '(' ~ ')' [ <-[()]>+ || <&paren_any> ]* }
say "Extract (a(b)c) and ((d)f(g))" ~~ m:g/<&paren_any>/;
# => (「(a(b)c)」 「((d)f(g))」)
Overlapping multiple balanced parentheses matches:
say "Extract (a(b)c) and ((d)f(g))" ~~ m:ov:g/<&paren_any>/;
# => (「(a(b)c)」 「(b)」 「((d)f(g))」 「(d)」 「(g)」)
See demo.
Python re non-regex solution
See poke's answer for How to get an expression between balanced parentheses.
Java customizable non-regex solution
Here is a customizable solution allowing single character literal delimiters in Java:
public static List<String> getBalancedSubstrings(String s, Character markStart,
Character markEnd, Boolean includeMarkers)
{
List<String> subTreeList = new ArrayList<String>();
int level = 0;
int lastOpenDelimiter = -1;
for (int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
if (c == markStart) {
level++;
if (level == 1) {
lastOpenDelimiter = (includeMarkers ? i : i + 1);
}
}
else if (c == markEnd) {
if (level == 1) {
subTreeList.add(s.substring(lastOpenDelimiter, (includeMarkers ? i + 1 : i)));
}
if (level > 0) level--;
}
}
return subTreeList;
}
}
Sample usage:
String s = "some text(text here(possible text)text(possible text(more text)))end text";
List<String> balanced = getBalancedSubstrings(s, '(', ')', true);
System.out.println("Balanced substrings:\n" + balanced);
// => [(text here(possible text)text(possible text(more text)))]
The regular expression using Ruby (version 1.9.3 or above):
/(?<match>\((?:\g<match>|[^()]++)*\))/
Demo on rubular
The answer depends on whether you need to match matching sets of brackets, or merely the first open to the last close in the input text.
If you need to match matching nested brackets, then you need something more than regular expressions. - see #dehmann
If it's just first open to last close see #Zach
Decide what you want to happen with:
abc ( 123 ( foobar ) def ) xyz ) ghij
You need to decide what your code needs to match in this case.
"""
Here is a simple python program showing how to use regular
expressions to write a paren-matching recursive parser.
This parser recognises items enclosed by parens, brackets,
braces and <> symbols, but is adaptable to any set of
open/close patterns. This is where the re package greatly
assists in parsing.
"""
import re
# The pattern below recognises a sequence consisting of:
# 1. Any characters not in the set of open/close strings.
# 2. One of the open/close strings.
# 3. The remainder of the string.
#
# There is no reason the opening pattern can't be the
# same as the closing pattern, so quoted strings can
# be included. However quotes are not ignored inside
# quotes. More logic is needed for that....
pat = re.compile("""
( .*? )
( \( | \) | \[ | \] | \{ | \} | \< | \> |
\' | \" | BEGIN | END | $ )
( .* )
""", re.X)
# The keys to the dictionary below are the opening strings,
# and the values are the corresponding closing strings.
# For example "(" is an opening string and ")" is its
# closing string.
matching = { "(" : ")",
"[" : "]",
"{" : "}",
"<" : ">",
'"' : '"',
"'" : "'",
"BEGIN" : "END" }
# The procedure below matches string s and returns a
# recursive list matching the nesting of the open/close
# patterns in s.
def matchnested(s, term=""):
lst = []
while True:
m = pat.match(s)
if m.group(1) != "":
lst.append(m.group(1))
if m.group(2) == term:
return lst, m.group(3)
if m.group(2) in matching:
item, s = matchnested(m.group(3), matching[m.group(2)])
lst.append(m.group(2))
lst.append(item)
lst.append(matching[m.group(2)])
else:
raise ValueError("After <<%s %s>> expected %s not %s" %
(lst, s, term, m.group(2)))
# Unit test.
if __name__ == "__main__":
for s in ("simple string",
""" "double quote" """,
""" 'single quote' """,
"one'two'three'four'five'six'seven",
"one(two(three(four)five)six)seven",
"one(two(three)four)five(six(seven)eight)nine",
"one(two)three[four]five{six}seven<eight>nine",
"one(two[three{four<five>six}seven]eight)nine",
"oneBEGINtwo(threeBEGINfourENDfive)sixENDseven",
"ERROR testing ((( mismatched ))] parens"):
print "\ninput", s
try:
lst, s = matchnested(s)
print "output", lst
except ValueError as e:
print str(e)
print "done"
You need the first and last parentheses. Use something like this:
str.indexOf('('); - it will give you first occurrence
str.lastIndexOf(')'); - last one
So you need a string between,
String searchedString = str.substring(str1.indexOf('('),str1.lastIndexOf(')');
because js regex doesn't support recursive match, i can't make balanced parentheses matching work.
so this is a simple javascript for loop version that make "method(arg)" string into array
push(number) map(test(a(a()))) bass(wow, abc)
$$(groups) filter({ type: 'ORGANIZATION', isDisabled: { $ne: true } }) pickBy(_id, type) map(test()) as(groups)
const parser = str => {
let ops = []
let method, arg
let isMethod = true
let open = []
for (const char of str) {
// skip whitespace
if (char === ' ') continue
// append method or arg string
if (char !== '(' && char !== ')') {
if (isMethod) {
(method ? (method += char) : (method = char))
} else {
(arg ? (arg += char) : (arg = char))
}
}
if (char === '(') {
// nested parenthesis should be a part of arg
if (!isMethod) arg += char
isMethod = false
open.push(char)
} else if (char === ')') {
open.pop()
// check end of arg
if (open.length < 1) {
isMethod = true
ops.push({ method, arg })
method = arg = undefined
} else {
arg += char
}
}
}
return ops
}
// const test = parser(`$$(groups) filter({ type: 'ORGANIZATION', isDisabled: { $ne: true } }) pickBy(_id, type) map(test()) as(groups)`)
const test = parser(`push(number) map(test(a(a()))) bass(wow, abc)`)
console.log(test)
the result is like
[ { method: 'push', arg: 'number' },
{ method: 'map', arg: 'test(a(a()))' },
{ method: 'bass', arg: 'wow,abc' } ]
[ { method: '$$', arg: 'groups' },
{ method: 'filter',
arg: '{type:\'ORGANIZATION\',isDisabled:{$ne:true}}' },
{ method: 'pickBy', arg: '_id,type' },
{ method: 'map', arg: 'test()' },
{ method: 'as', arg: 'groups' } ]
While so many answers mention this in some form by saying that regex does not support recursive matching and so on, the primary reason for this lies in the roots of the Theory of Computation.
Language of the form {a^nb^n | n>=0} is not regular. Regex can only match things that form part of the regular set of languages.
Read more # here
I didn't use regex since it is difficult to deal with nested code. So this snippet should be able to allow you to grab sections of code with balanced brackets:
def extract_code(data):
""" returns an array of code snippets from a string (data)"""
start_pos = None
end_pos = None
count_open = 0
count_close = 0
code_snippets = []
for i,v in enumerate(data):
if v =='{':
count_open+=1
if not start_pos:
start_pos= i
if v=='}':
count_close +=1
if count_open == count_close and not end_pos:
end_pos = i+1
if start_pos and end_pos:
code_snippets.append((start_pos,end_pos))
start_pos = None
end_pos = None
return code_snippets
I used this to extract code snippets from a text file.
This do not fully address the OP question but I though it may be useful to some coming here to search for nested structure regexp:
Parse parmeters from function string (with nested structures) in javascript
Match structures like:
matches brackets, square brackets, parentheses, single and double quotes
Here you can see generated regexp in action
/**
* get param content of function string.
* only params string should be provided without parentheses
* WORK even if some/all params are not set
* #return [param1, param2, param3]
*/
exports.getParamsSAFE = (str, nbParams = 3) => {
const nextParamReg = /^\s*((?:(?:['"([{](?:[^'"()[\]{}]*?|['"([{](?:[^'"()[\]{}]*?|['"([{][^'"()[\]{}]*?['")}\]])*?['")}\]])*?['")}\]])|[^,])*?)\s*(?:,|$)/;
const params = [];
while (str.length) { // this is to avoid a BIG performance issue in javascript regexp engine
str = str.replace(nextParamReg, (full, p1) => {
params.push(p1);
return '';
});
}
return params;
};
This might help to match balanced parenthesis.
\s*\w+[(][^+]*[)]\s*
This one also worked
re.findall(r'\(.+\)', s)

Flex RegEx to find string not starting with a pattern

I'm writing a lexer to scan a modified version of an INI file.
I need to recognize the declaration of variables, comments and strings (between double quotes) to be assigned to a variable. For example, this is correct:
# this is a comment
var1 = "string value"
I've successfully managed to recognize these tokens forcing the # at the begging of the comment regular expression and " at the end of the string regular expression, but I don't want to do this because later on, using Bison, the tokens I get are exactly # this is a comment and "string value". Instead I want this is a comment (without #) and string value (without ")
These are the regular expressions that I currently use:
[a-zA-Z][a-zA-Z0-9]* { return TOKEN_VAR_NAME; }
["][^\n\r]*["] { return TOKEN_STRING; }
[#][^\n\r]* { return TOKEN_COMMENT; }
Obviously there can be any number of white spaces, as well as tabs, inside the string, the comment and between the variable name and the =.
How could I achieve the result I want?
Maybe it will be easier if I show you a complete example of a correct input file and also the grammar rules I use with Flex and Bison.
Correct input file example:
[section1]
var1 = "string value"
var2 = "var1 = text"
# this is a comment
# var5 = "some text" this is also a valid comment
These are the regular expressions for the lexer:
"[" { return TOKEN::SECTION_START; }
"]" { return TOKEN::SECTION_END; }
"=" { return TOKEN::ASSIGNMENT; }
[#][^\n\r]* { return TOKEN::COMMENT; }
[a-zA-Z][a-zA-Z0-9]* { *m_yylval = yytext; return TOKEN::ID; }
["][^\n\r]*["] { *m_yylval = yytext; return TOKEN::STRING; }
And these are the syntax rules:
input : input line
| line
;
line : section
| value
| comment
;
section : SECTION_START ID SECTION_END { createNewSection($2); }
;
value : ID ASSIGNMENT STRING { addStringValue($1, $3); }
;
comment : COMMENT { addComment($1); }
;
To do that you have to treat " and # as different tokens (so they get scanned as individual tokens, different from the one you are scanning now) and use a %s or %x start condition to change the accepted regular patterns on reading those tokens with the scanner input.
This adds another drawback, that is, you will receive # as an individual token before the comment and " before and after the string contents, and you'll have to cope with that in your grammar. This will complicate your grammar and the scanner, so I have to discourage you to follow this approach.
There is a better solution, by writting a routine to unescape things and allow the scanner to be simpler by returning all the input string in yytext and simply
m_yylval = unescapeString(yytext); /* drop the " chars */
return STRING;
or
m_yylval = uncomment(yytext); /* drop the # at the beginning */
return COMMENT; /* return EOL if you are trying the exmample at the end */
in the yylex(); function.
Note
As comments are normally ignored, the best thing is to ignore using a rule like:
"#".* ; /* ignored */
in your flex file. This makes generated scanner not return and ignore the token just read.
Note 2
You probably don't have taken into account that your parser will allow you to introduce lines on the form:
var = "data"
in front of any
[section]
line, so you'll run into trouble trying to addStringvalue(...); when no section has been created. One possible solution is to modify your grammar to separate file in sections and force them to begin with a section line, like:
compilation: file comments ;
file: file section
| ; /* empty */
section: section_header section_body;
section_header: comments `[` ident `]` EOL
section_body: section_body comments assignment
| ; /* empty */
comments: comments COMMENT
| ; /* empty */
This has complicated by the fact that you want to process the comments. If you were to ignore them (with using ; in the flex scanner) The grammar would be:
file: empty_lines file section
| ; /* empty */
empty_lines: empty_lines EOL
| ; /* empty */
section: header body ;
header: '[' IDENT ']' EOL ;
body: body assignment
| ; /* empty */
assignment: IDENT '=' strings EOL
| EOL ; /* empty lines or lines with comments */
strings:
strings unit
| unit ;
unit: STRING
| IDENT
| NUMBER ;
This way the first thing allowed in your file is, apart of comments, that are ignored and blank space (EOLs are not considered blank space as we cannot ignore them, they terminate lines)

How do I use a regex to find a duplicated string on the left and the right of an equals sign

I have a file with some translation and some missing translations where the english key equals the translation.
...
/* comment1 */
"An unexpected error occurred." = "Ein unerwarteter Fehler ist aufgetreten.";
/* comment2 */
"Enter it here..." = "Enter it here...";
...
Is it possible to:
Find all occurrences of "X" = "X";?
Bonus: For all occurrences delete the line, the comment line above and newline above that?
You'll need to use backreferences here, something along the lines of:
/"(.+)"\s*=\s*"\1"/
^ ^
| |
| backreference to first string
|
capture group for first string
Note that the syntax for backreferences varies between languages, the above one works for your case in Ruby, e.g.
❯ irb
2.2.2 :001 > r = /"(.+)"\s*=\s*"\1"/
=> /"(.+)"\s*=\s*"\1"/
2.2.2 :002 > r.match('"foo" = "foo"')
=> #<MatchData "\"foo\" = \"foo\"" 1:"foo">
2.2.2 :003 > r.match('"foo" = "bar"')
=> nil
In response to your comment about wanting to do it in a text editor, remove the leading/trailing slashes and the above regex should work fine in Sublime Text... YMMV in other editors.
For the Bonus question:
(\R\R)?+/\*[^*]*(?:\*+(?!/)[^*]*)*\*/\R("[^"]*") = \2;(?(1)|\R{0,2})
demo
(works with notepad++, remove the newline above, except for the first item.)
You can find all the occurences by matching each line with the following pattern: "(.*?)"\s*=\s*"\1", if you got a match you can delete the line.
Java working example
public class StackOverflow32507709 {
public static String pattern;
static {
pattern = "\"(.*?)\"\\s*=\\s*\"\\1\"";
}
public static void main(String[] args) {
String[] text = {
"/* comment1 */",
"\r\n",
"\"An unexpected error occurred\" = \"German translation...\";\r\n",
"\r\n",
"\"Enter it here\" = \"Enter it here\";\r\n"
};
List<String> filteredTranslations = new ArrayList<String>();
Pattern p = Pattern.compile(pattern);
for (String line : text) {
Matcher m = p.matcher(line);
if (!m.find()) {
filteredTranslations.add(line);
}
m.reset();
}
for (String filteredTranslation : filteredTranslations) {
System.out.println(filteredTranslation);
}
}
}
You need to use a backreference, like this: http://www.regular-expressions.info/backref.html
I can't give you a full answer because you haven't said which programming language you are using, but I'm sure you can figure it out from there.

OCamllex matching beginning of line?

I am messing around writing a toy programming language in OCaml with ocamllex, and was trying to make the language sensitive to indentation changes, python-style, but am having a problem matching the beginning of a line with ocamllex's regex rules. I am used to using ^ to match the beginning of a line, but in OCaml that is the string concat operator. Google searches haven't been turning up much for me unfortunately :( Anyone know how this would work?
I'm not sure if there is explicit support for zero-length matching symbols (like ^ in Perl-style regular expressions, which matches a position rather than a substring). However, you should be able to let your lexer turn newlines into an explicit token, something like this:
parser.mly
%token EOL
%token <int> EOLWS
% other stuff here
%%
main:
EOL stmt { MyStmtDataType(0, $2) }
| EOLWS stmt { MyStmtDataType($1 - 1, $2) }
;
lexer.mll
{
open Parser
exception Eof
}
rule token = parse
[' ' '\t'] { token lexbuf } (* skip other blanks *)
| ['\n'][' ']+ as lxm { EOLWS(String.length(lxm)) }
| ['\n'] { EOL }
(* ... *)
This is untested, but the general idea is:
Treat newlines as staetment 'starters'
Measure whitespace that immediately follows the newline and pass its length as an int
Caveat: you will need to preprocess your input to start with a single \n if it doesn't contain one.