Regular experession to match math block - regex

Hi I need to match math block in string. The math block starts with $$ and ends with $$. There can be any number of math blocks.
for example input can look like:
abcd... asdfasdf
$$
math expression
$$
<another set of random words>
$$
expression
$$
...
What is the right regex to match only the math expressions?
Thanks.

you could try /\$\$((?:\$[^\$]|[^\$])+)\$\$/g and that will match anything between $$ and $$ including single $.
Example:
http://regexr.com/3gu75
let text = document.body.innerHTML;
let regex = /\$\$((?:\$[^\$]|[^\$])+)\$\$/g,
match;
while( (match = regex.exec(text)) != null) {
console.log(match[1].trim());
}
abcd... asdfasdf
$$
math expression
$$
another set of random words
$$
expression
$$
$$
expression with a $ symbol
$$

I don't know what is typescript, but i looked it up it's like javascript. Why don't you use new RegExp()?
like new RegExp(/^\$\$.+\$\$$/g).exec(your str)
Char .+ will detect any character and so on until it find $ at the end. You can change this into your specific condition to match

Assuming that every odd-numbered $$ starts a math mode block, and every even-numbered one ends the previous block, you could just split the string on $$ and take the odd elements of the array:
> str=`abcd... asdfasdf
$$
math expression
$$
<another set of random words>
$$
expression
$$`
> str.split("$$").filter((s, index) => index % 2 === 1)
Array [ " math expression ", " expression " ]
The array still contains the leading and trailing spaces - you can use String.trim to get rid of them.

Related

Changing how Antlr4 prints a tree

I am using Antlr4 in IntelliJ to make a small compiler for arithmetic expressions.
I want to print the tree and use this code snippet to do so.
JFrame frame = new JFrame("Tree");
JPanel panel = new JPanel();
TreeViewer viewr = new TreeViewer(Arrays.asList(
parser.getRuleNames()),tree);
viewr.setScale(2);//scale a little
panel.add(viewr);
frame.add(panel);
frame.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);
frame.setSize(400,400);
frame.setVisible(true);
This makes a tree which looks like this, for the input 3*5\n
Is there a way to adjust this so it reads from top to bottom
Statement
Expression /n
INT * INT
3 5
instead?
My grammar is defined as:
grammar Expression;
statement: expression ENDSTATEMENT # printExpr
| ID '=' expression ENDSTATEMENT # assign
| ENDSTATEMENT # blank
;
expression: expression MULTDIV expression # MulDiv
| expression ADDSUB expression # AddSub
| INT # int
| FLOAT # float
| ID # id
| '(' expression ')' # parens
;
ID : [a-zA-Z]+ ; // match identifiers
INT : [0-9]+ ; // match integers
MULTDIV : ('*' | '/'); //match multiply or divide
ADDSUB : ('+' | '-'); //match add or subtract
FLOAT: INT '.' INT; //match a floating point number
ENDSTATEMENT:'\r'? '\n' ; // return newlines to parser (is end-statement signal)
WHITESPACE : [ \t]+ -> skip ; // ignore whitespace
No, not without changing the source of the tree viewer yourself.

Yacc grammar producing incorrect terminal

I've been working on a hobby compiler for a while now, using lex and yacc for the parsing stage. This is all working fine for the majority of things, but when I added in if statements, the production rule for symbols is now giving the previous (or next?) item on the stack instead of the symbol value needed.
Grammar is given below with hopefully unrelated rules taken out:
%{
...
%}
%define parse.error verbose
%token ...
%%
Program:
Function { root->addChild($1);}
;
Function:
Type Identifier '|' ArgumentList '|' StatementList END
{ $$ = new FunctionDef($1, $2, $4, $6); }
/******************************************/
/* Statements and control flow ************/
/******************************************/
Statement:
Expression Delimiter
| VariableDeclaration Delimiter
| ControlFlowStatement Delimiter
| Delimiter
;
ControlFlowStatement:
IfStatement
;
IfStatement:
IF Expression StatementList END { $$ = new IfStatement($2, $3); }
| IF Expression StatementList ELSE StatementList END { $$ = new IfStatement($2, $3, $5);}
;
VariableDeclaration:
Type Identifier { $$ = new VariableDeclaration($1, $2);}
| Type Identifier EQUALS Expression { $$ = new VariableDeclaration($1, $2, $4);}
;
StatementList:
StatementList Statement { $1->addChild($2); }
| Statement { $$ = new GenericList($1); }
;
Delimiter:
';'
| NEWLINE
;
Type:
...
Expression:
...
PostfixExpression:
Value '[' Expression ']' { std::cout << "TODO: indexing operators ([ ])" << std::endl;}
| Value '.' SYMBOL { std::cout << "TODO: member access" << std::endl;}
| Value INCREMENT { $$ = new UnaryExpression(UNARY_POSTINC, $1); }
| Value DECREMENT { $$ = new UnaryExpression(UNARY_POSTDEC, $1); }
| Value '(' ')' { $$ = new FunctionCall($1, NULL); }
| Value '(' ExpressionList ')' { $$ = new FunctionCall($1, $3); }
| Value
;
Value:
BININT { $$ = new Integer(yytext, 2); }
| HEXINT { $$ = new Integer(yytext, 16); }
| DECINT { $$ = new Integer(yytext); }
| FLOAT { $$ = new Float(yytext); }
| SYMBOL { $$ = new Symbol(yytext); }
| STRING { $$ = new String(yytext); }
| LambdaFunction
| '(' Expression ')' { $$ = $2; }
| '[' ExpressionList ']' { $$ = $2;}
;
LambdaFunction:
...
%%
I cannot work out what about the control flow code can make the Symbol:
rule match something that isn't classed as a symbol from the lex definition:
symbol [a-zA-Z_]+(alpha|digit)*
...
{symbol} {return SYMBOL;}
Any help from somebody who knows about yacc and grammars in general would be very much appreciated. Also example files of the syntax it parses can be shown if necessary.
Thanks!
You cannot count on the value of yytext outside of a flex action.
Bison grammars typically read a lookahead token before deciding on how to proceed, so in a bison action, yytext has already been replaced with the token value of the lookahead token. (You can't count on that either, though: sometimes no lookahead token is needed.)
So you need to make a copy of yytext before the flex action returns and make that copy available to the bison grammar by placing it into the yylval semantic union.
See this bison FAQ entry
By the way, the following snippet from your flex file is incorrect:
symbol [a-zA-Z_]+(alpha|digit)*
In that regular expression, alpha and digit are just ordinary strings, so it is the same as [a-zA-Z_]+("alpha"|"digit")*, which means that it will match, for example, a_digitdigitdigit but not a_123. (It would have matched a_digitdigitdigit without the part following the +, so I presume that wasn't your intention.)
On the whole, I think it's better to use Posix character classes than either hand-written character classes or defined symbols, so I would write that as
symbol [[:alpha:]_]([[:alnum:]_]*[[:alnum:]])?
assuming that your intention is that a symbol can start but not end with an underscore, and end but not start with a digit. Using Posix character classes requires you to execute flex with the correct locale -- almost certainly the C locale -- but so do character ranges, so there is nothing to be lost by using the self-documenting Posix classes.
(Of course, I have no idea what your definitions of {alpha} and {digit} are, but it seems to me that they are either the same as [[:alpha:]] and [[:digit:]], in which case they are redundant, or different from the Posix classes, in which case they are confusing to the reader.)

What is $$ in bison?

In the bison manual in section 2.1.2 Grammar Rules for rpcalc, it is written that:
In each action, the pseudo-variable $$ stands for the semantic value
for the grouping that the rule is going to construct. Assigning a
value to $$ is the main job of most actions
Does that mean $$ is used for holding the result from a rule? like:
exp exp '+' { $$ = $1 + $2; }
And what's the typical usage of $$ after begin assigned to?
Yes, $$ is used to hold the result of the rule. After being assigned to, it typically becomes a $x in some higher-level (or lower precedence) rule.
Consider (for example) input like 2 * 3 + 4. Assuming you follow the normal precedence rules, you'd have an action something like: { $$ = $1 * $3; }. In this case, that would be used for the 2 * 3 part and, obviously enough, assign 6 to $$. Then you'd have your { $$ = $1 + $3; } to handle the addition. For this action, $1 would be given the value 6 that you assigned to $$ in the multiplication rule.
Does that mean $$ is used for holding the result from a rule? like:
Yes.
And what's the typical usage of $$ after begin assigned to?
Typically you won’t need that value again. Bison uses it internally to propagate the value. In your example, $1 and $2 are the respective semantic values of the two exp productions, that is, their values were set somewhere in the semantic rule for exp by setting its $$ variable.
Try this. Create a YACC file with:
%token NUMBER
%%
exp: exp '+' NUMBER { $$ = $1 + $3; }
| exp '-' NUMBER { $$ = $1 - $3; }
| NUMBER { $$ = $1; }
;
Then process it using Bison or YACC. I am using Bison but I assume YACC is the same. Then just find the "#line" directives. Let us find the "#line 3" directive; it and the relevant code will look like:
#line 3 "DollarDollar.y"
{ (yyval) = (yyvsp[(1) - (3)]) + (yyvsp[(3) - (3)]); }
break;
And then we can quickly see that "$$" expands to "yyval". That other stuff, such as "yyvsp", is not so obvious but at least "yyval" is.
$$ represents the result reference of the current expression's evaluation. In other word, its result.Therefore, there's no particular usage after its assignation.
Bye !

OCamllex matching beginning of line?

I am messing around writing a toy programming language in OCaml with ocamllex, and was trying to make the language sensitive to indentation changes, python-style, but am having a problem matching the beginning of a line with ocamllex's regex rules. I am used to using ^ to match the beginning of a line, but in OCaml that is the string concat operator. Google searches haven't been turning up much for me unfortunately :( Anyone know how this would work?
I'm not sure if there is explicit support for zero-length matching symbols (like ^ in Perl-style regular expressions, which matches a position rather than a substring). However, you should be able to let your lexer turn newlines into an explicit token, something like this:
parser.mly
%token EOL
%token <int> EOLWS
% other stuff here
%%
main:
EOL stmt { MyStmtDataType(0, $2) }
| EOLWS stmt { MyStmtDataType($1 - 1, $2) }
;
lexer.mll
{
open Parser
exception Eof
}
rule token = parse
[' ' '\t'] { token lexbuf } (* skip other blanks *)
| ['\n'][' ']+ as lxm { EOLWS(String.length(lxm)) }
| ['\n'] { EOL }
(* ... *)
This is untested, but the general idea is:
Treat newlines as staetment 'starters'
Measure whitespace that immediately follows the newline and pass its length as an int
Caveat: you will need to preprocess your input to start with a single \n if it doesn't contain one.

I want to replace ',' on the 150th location in a String with a <br>

My String is : PI Last Name equal to one of
('AARONSON','ABDEL MEGUID','ABDEL-LATIF','ABDOOL KARIM','ABELL','ABRAMS','ACKERMAN','ADAIR','ADAMS','ADAMS-CAMPBELL', 'ADASHI','ADEBAMOWO','ADHIKARI','ADIMORA','ADRIAN', 'ADZERIKHO','AGADJANYAN','AGARWAL','AGOT', 'AGUIRRE-CRUZ','AHMAD','AHMED','AIKEN', 'AINAMO', 'AISENBERG','AJAIYEOBA','AKA','AKHTAR','AKINGBEMI','AKINYINKA','AKKERMAN','AKSOY','AKYUREK', 'ALBEROLA-ILA','ALBERT','ALCANTARA' ,'ALCOCK','ALEMAN', 'ALEXANDER','ALEXANDRE','ALEXANDROV','ALEXANIAN','ALLAND','ALLEN','ALLISON','ALPER', 'ALTMAN','ALVAREZ','AMARYAN','AMBESI-IMPIOMBATO','AMEGBETO','AMOWITZ', 'ANAGNOSTARAS','ANAND','ANDERSEN','ANDERSON', 'ANDRADE','ANDREEFF','ANDROPHY','ANGER','ANHOLT','ANTHONY','ANTLE','ANTONELLI','ANTONY', 'ANZULOVICH', 'APODACA','APOSHIAN','APPEL','APPLEBY','APRIL','ARAUJO','ARBIB','ARBOLEDA', 'ARCHAKOV','ARCHER', 'ARECHAVALETA-VELASCO','ARENS','ARGON','ARGYROKASTRITIS', 'ARIAS','ARIZAGA','ARMSTRONG','ARNON', 'ARSHAVSKY','ARVIN','ASATRYAN','ASCOLI','ASKENASE','ASSI','ATALAY','ATANASOVA','ATKINSON','ATTYGALLE','ATWEH','AU','AVETISYAN','AWE','AYOUB','AZAD','BACSO','BAGASRA','BAKER','BALAS', 'BALCAZAR','BALK','BALKAY','BALLOU','BALRAJ','BALSTER','BANERJEE','BANKOLE','BANTA','BARAL','BARANOWSKA','BARBAS', 'BARBER','BARILLAS-MURY','BARKHOLT','BARNES','BARNETT','BARRETT','BARRIA','BARROW','BARROWS','BARTKE','BARTLETT','BASSINGTHWAIGHTE','BASSIOUNY','BASU','BATES','BATTAGLIA','BATTERMAN','BAUER','BAUERLE','BAUM','BAUME', 'BAUMLER','BAVISTER','BAWA','BAYNE','BEASLEY','BEATTY','BEATY','BEBENEK','BECK','BECKER','BECKMAN','BECKMAN-SUURKULA' ,'BEDFORD','BEDOLLA','BEEBE','BEEMON','BEHETS','BEHRMAN','BEIER','BEKKER','BELL','BELLIDO','BELMAIN', 'BENATAR','BENBENISHTY','BENBROOK','BENDER','BENEDETTI','BENNETT','BENNISH','BENZ','BERG','BERGER','BERGEY','BERGGREN','BERK','BERKOWITZ','BERLIN','BERLINER','BERMAN','BERTINO','BERTOZZI','BERTRAND','BERWICK','BETHONY','BEYERS','BEYRER' ,'BEZPROZVANNY','BHAGWAT','BHANDARI','BHARGAVA','BHARUCHA','BHUJWALLA','BIANCO','BIDLACK','BIELERT','BIER','BIESSMANN','BIGELOW' ,'BILLER','BILLINGS','BINDER','BINDMAN','BINUTU','BIRBECK','BIRGE','BIRNBAUM','BIRO','BIRT','BISHAI','BISHOP','BISSELL','BJORKEGREN','BJORNSTAD','BLACK','BLANCHARD','BLASS','BLATTNER','BLIGNAUT','BLOCH','BLOCK','BLOOM','BLOOM,','BLUM','BLUMBERG' ,'BLUMENTHAL','BLYUKHER','BODDULURI','BOFFETTA','BOGOLIUBOVA', 'BOLLINGER','BOLLS','BOMSZTYK','BONANNO','BONNER','BOOM','BOOTHROYD','BOPPANA','BORAWSKI','BORG','BORIS-LAWRIE','BORISY','BORLONGAN','BORNSTEIN','BORODOVSKY','BORST','BOS','BOTO','BOWDEN','BOWEN','BOYCE-JACINO','BRADEN','BRADY' ,'BRAITHWAITE','BRANN','BRASH','BRAUNSTEIN', 'BREMAN','BRENNAN','BRENNER','BRETSCHER','BREW','BREYSSE','BRIGGS','BRITES','BRITT','BRITTENHAM','BRODIE','BRODY','BROOK','BROOTEN','BROSCO','BROSNAN','BROWN','BROWNE','BRUCKNER','BRUNENGRABER','BRYL','BRYSON','BU','BUCHAN','BUDD','BUDNIK', 'BUEKENS','BUKRINSKY','BULLMORE','BULUN','BURBANO','BURGENER','BURGESS','BURKS','BURMEISTER','BURNETT','BURNHAM','BURNS','BURRIDGE','BURTON','BUSCIGLIO','BUSHEK','BUSIJA','BUZSAKI','BZYMEK','CABA')
I need to have a regex which will greedily looks for up to 150 characters with a last character being a ','. And then replace the last ',' of the 150 with a <br />
Any suggestions pls?
I used this ','(?=[^()]*\)) but this one replaces all the occurences. I want the 150th ones to be replaced.
Thanks everyone for your suggestions. I managed to do it with Java code instead of regex.
StringBuilder sb = new StringBuilder(html);
int i = 0;
while ((i = sb.indexOf("','", i + 150)) != -1) {
int j = sb.lastIndexOf("','", i + 150);
sb.insert(i+1, "<BR>");
}
return sb.toString();
However, this breaks at the first encounter of ',' in the 150 chars.
Can anyone help modify my code to incorporate the break at the last occurence of ',' withing the 150 chars.
You'll want something like this:
Look for every occurrence of \([^)]+*,[^)]+*\) (Find a parenthesis-wrapped string with a comma in it and then run the following regular expression on each of the matched elements:
(.{135,150}[^,]*?),
The first number is the minimum number of characters you want to match before you add a break tag -- the second is the maximum number of characters you would like to match before inserting a break tag. If there is no , between the characters in question then the regular expression will continue to consume characters until it finds a comma.
You could probably do it like this:
regex ~ /(^.{1,14}),/
replacement ~ '\1<replacement' or "$1<insert your text>"
In Perl:
$target = ','x 22;
$target =~ s/(^ .{1,14}) , /$1<15th comma>/x;
print $target;
Output
,,,,,,,,,,,,,,<15th comma>,,,,,,,
Edit: As an alternative, if you want to break the string up into succesive 150 or less
you could do it this way:
regex ~ /(.{1,150},)/sg
replacement ~ '\1<br/>' or "$1<br\/>"
// That is a regex of type global (/g) and include newlines (/s)
In Perl:
$target = "
('AARONSON','ABDEL MEGUID','ABDEL-LATIF','ABDOOL KARIM','ABELL','ABRAMS','ACKERMAN','ADAIR','ADAMS','ADAMS-CAMPBELL', 'ADASHI','ADEBAMOWO','ADHIKARI','ADIMORA','ADRIAN', 'ADZERIKHO','AGADJANYAN','AGARWAL','AGOT', 'AGUIRRE-CRUZ','AHMAD','AHMED','AIKEN', 'AINAMO', 'AISENBERG','AJAIYEOBA','AKA','AKHTAR','AKINGBEMI','AKINYINKA','AKKERMAN','AKSOY','AKYUREK', 'ALBEROLA-ILA','ALBERT','ALCANTARA' ,'ALCOCK','ALEMAN', 'ALEXANDER','ALEXANDRE','ALEXANDROV','ALEXANIAN','ALLAND','ALLEN','ALLISON','ALPER', 'ALTMAN', ... )
";
if ($target =~ s/( .{1,150} , )/$1<br\/>/sxg) {
print $target;
}
Output:
('AARONSON','ABDEL MEGUID','ABDEL-LATIF','ABDOOL KARIM','ABELL','ABRAMS','ACKERMAN','ADAIR','ADAMS','ADAMS-CAMPBELL', 'ADASHI','ADEBAMOWO','ADHIKARI',<br/>'ADIMORA','ADRIAN', 'ADZERIKHO','AGADJANYAN','AGARWAL','AGOT', 'AGUIRRE-CRUZ','AHMAD','AHMED','AIKEN', 'AINAMO', 'AISENBERG','AJAIYEOBA','AKA',<br/>'AKHTAR','AKINGBEMI','AKINYINKA','AKKERMAN','AKSOY','AKYUREK', 'ALBEROLA-ILA','ALBERT','ALCANTARA' ,'ALCOCK','ALEMAN', 'ALEXANDER','ALEXANDRE',<br/>'ALEXANDROV','ALEXANIAN','ALLAND','ALLEN','ALLISON','ALPER', 'ALTMAN',<br/> ... )