How to fix mismatched brackets in a TeX string? - regex

For example I have a lot of Tex strings like
|u(x,t)|^2 = \frac{1}{\sqrt{1+(4+t)^2))e^{-\frac{2(x-k_0t)^2}{1+4t^2))
The problem with the above Tex is that the bracket is not matching. \frac{1}{\sqrt{1+(4+t)^2)) should be \frac{1}{\sqrt{1+(4+t)^2}} and {-\frac{2(x-k_0t)^2}{1+4t^2)) should be {-\frac{2(x-k_0t)^2}{1+4t^2}}
wrong: \frac{1}{\sqrt{1+(4+t)^2))
right: \frac{1}{\sqrt{1+(4+t)^2}}
wrong: {-\frac{2(x-k_0t)^2}{1+4t^2))
right: {-\frac{2(x-k_0t)^2}{1+4t^2}}
explanation: The first example is not right, because for last two ) there is no ) matching it, and it should be } to match previous {
I want to know how to automatically correct such mismatched brackets? I have perl installed and I intended to do it with regex, but can't figure out a way.

I don't know if I'm understanding you correctly, but it sounds to me like you need to count brackets and make sure that the number of ( or [ or { is equal to the number of corresponding ) or ] or }.
One possible solution is using a hash for every line of TeX and storing values in it (not sure how the file looks. I assume all lines are like you provided):
#!/usr/bin/perl
use strict;
use warnings;
my $file = shift;
my $line_num = 0;
open FH, "<$file" or die "Error: $!\n";
while(<FH>) {
my %brackets = (
'(' => 0,
'[' => 0,
'{' => 0
);
$line_num++;
my #chars = split //, $_;
### Count brackets.
foreach my $char (#chars) {
if ($char eq '(' or $char eq '[' or $char eq '{') {
$brackets{$char}++;
} elsif ($char eq ')' or $char eq ']' or $char eq '}') {
if ($char eq ')') { $brackets{'('}--; }
if ($char eq ']') { $brackets{'['}--; }
if ($char eq '}') { $brackets{'{'}--; }
} else {
next;
}
}
### Check that all hash values are 0.
foreach my $bracket (keys %brackets) {
if ($brackets{$bracket} != 0) {
print "In line $line_num: '$bracket' missing $brackets{$bracket} closing brackets.\n";
}
}
}
This code will at least tell you where the errors occured and give you a general idea of the nature of these errors. for input such as )ff){gfs[[y[46rw] the output will be:
In line 1: '{' missing 1 closing brackets.
In line 1: '[' missing 2 closing brackets.
In line 1: '(' missing -2 closing brackets.
Instead of printing the brackets (or storing the number of brackets. Probably better to store index of the brackets) you can write simple code to fix this because at this point, you'll have all the information you need.
This is not a simple question if the errors in the file have no pattern. I recommend looking for one before actually trying

There must be some kind of condition, which indicates where the parenthesis are used instead of the braces. I assume that it is either in front of e^ and at the end of the line.
This fixes the first one:
perl -pi~ -e 's/\)\)e\^/}}e^/' file.tex
And this the second:
perl -pi~ -e 's/\)\)$/}}/' file.tex

Related

Dynamic regular expression for Nesting brackets failed due to unknow bugs

rencently I have met a strange bug when use a dynamic regular expressions in perl for Nesting brackets' match. The origin string is " {...test{...}...} ", I want to grep the pair brace begain with test, "test{...}". actually there are probably many pairs of brace before and end this group , I don't really know the deepth of them.
Following is my match scripts: nesting_parser.pl
#! /usr/bin/env perl
use Getopt::Long;
use Data::Dumper;
my %args = #ARGV;
if(exists$args{'-help'}) {printhelp();}
unless ($args{'-file'}) {printhelp();}
unless ($args{'-regex'}) {printhelp();}
my $OpenParents;
my $counts;
my $NestedGuts = qr {
(?{$OpenParents = 0})
(?>
(?:
[^{}]+
| \{ (?{$OpenParents++;$counts++; print "\nLeft:".$OpenParents." ;"})
| \} (?(?{$OpenParents ne 0; $counts++}) (?{$OpenParents--;print "Right: ".$OpenParents." ;"})) (?(?{$OpenParents eq 0}) (?!))
)*
)
}x;
my $string = `cat $args{'-file'}`;
my $partten = $args{'-regex'} ;
print "####################################################\n";
print "Grep [$partten\{...\}] from $args{'-file'}\n";
print "####################################################\n";
while ($string =~ /($partten$NestedGuts)/xmgs){
print $1."}\n";
print $2."####\n";
}
print "Regex has seen $counts brackts\n";
sub printhelp{
print "Usage:\n";
print "\t./nesting_parser.pl -file [file] -regex '[regex expression]'\n";
print "\t[file] : file path\n";
print "\t[regex] : regex string\n";
exit;
}
Actually my regex is:
our $OpenParents;
our $NestedGuts = qr {
(?{$OpenParents = 0})
(?>
(?:
[^{}]+
| \{ (?{$OpenParents++;})
| \} (?(?{$OpenParents ne 0}) (?{$OpenParents--})) (?(?{$OpenParents eq 0} (?!))
)*
)
}x;
I have add brace counts in nesting_parser.pl
I also write a string generator for debug: gen_nesting.pl
#! /usr/bin/env perl
use strict;
my $buffer = "{{{test{";
unless ($ARGV[0]) {print "Please specify the nest pair number!\n"; exit}
for (1..$ARGV[0]){
$buffer.= "\n\{\{\{\{$_\}\}\}\}";
#$buffer.= "\n\{\{\{\{\{\{\{\{\{$_\}\}\}\}\}\}\}\}\}";
}
$buffer .= "\n\}}}}";
open TEXT, ">log_$ARGV[0]";
print TEXT $buffer;
close TEXT;
You can generate a test file by
./gen_nesting.pl 1000
It will create a log file named log_1000, which include 1000 lines brace pairs
Now we test our match scripts:
./nesting_parser.pl -file log_1000 -regex "test" > debug_1000
debug_1000 looks like a great perfect result, matched successfully! But when I gen a 4000 lines test log file and match it again, it seem crashed:
./gen_nesting.pl 4000
./nesting_parser.pl -file log_4000 -regex "test" > debug_4000
The end of debug_4000 shows
{{{{3277}
####
Regex has seen 26213 brackts
I don't know what's wrong with the regex expresions, mostly it works well for paired brackets, untill recently I found it crashed when I try to match a text file more than 600,000 lines.
I'm really confused by this problems,
I really hope to solve this problem.
thank you all!
First for matching nested brackets I normally use Regexp::Common.
Next, I'm guessing that your problem is that Perl's regular expression engine breaks after matching 32767 groups. You can verify this by turning on warnings and looking for a message like Complex regular subexpression recursion limit (32766) exceeded.
If so, you can rewrite your code using /g and \G and pos. The idea being that you match the brackets in a loop like this untested code:
my $start = pos($string);
my $open_brackets = 0;
my $failed;
while (0 < $open_brackets or $start == pos($string)) {
if ($string =~ m/\G[^{}]*(\{|\})/g) {
if ($1 eq '{') {
$open_brackets++;
}
else {
$open_brackets--;
}
}
else {
$failed = 1;
break; # WE FAILED TO MATCH
}
}
if (not $failed and 0 == $open_brackets) {
my $matched = substr($string, $start, pos($string));
}

Using regex to empty contents of an if statement but retaining the outer curly braces only

I would like to use regex with sed or perl or awk on
the following lines of code stored in a file, say, file.txt:
if(url == "https://abcd.com"){
if(p=123){System.out.println("url is https://....(OK?)); }
else{
//do nothing!
}
c=1;
}
and would like to empty the contents of the outer if statement and retain the curly braces so that the output is:-
if(url == "https://abcd.com"){}.
Is it possible?
It depends on how arbitrary the code is within the outer braces. For example, these are possibilities:
if (....) {
c = '{';
// this is a mismatched brace: {
}
To cope with such possibilities you'd probably require something that parses the language in question rather than a regex.
The following perl code would do the trick on your exemple
$/ = undef;
my $data = <>;
if( $data =~ /if\(url == \"https:\/\/abcd.com\"\)\{(.+)\}/ms ){
$inside = $1 ;
$data =~ s/\Q$inside\E//ms ;
}
print $data ;
However it would probably fail if there is more code containing curly braces at the end of you sample as it is looking for the last curly brace and extracting all between the targeted if statement and this last curly brace.

How do I get perl to print n lines following a specific string?

I have a very large file and want to pull out all atom symbols and coordinates for the equilibrium geometry. The desired information is displayed as below:
***** EQUILIBRIUM GEOMETRY LOCATED *****
COORDINATES OF ALL ATOMS ARE (ANGS)
ATOM CHARGE X Y Z
-----------------------------------------------------------
C 6.0 0.8438492825 -2.0554543742 0.8601734285
C 6.0 1.7887997955 -1.2651150894 0.4121141006
N 7.0 1.3006136046 0.0934593194 0.2602148346
NOTE: After the coordinates finish there is a blank line.
I have so-far patched together a code that makes sense to me however it produces errors and I am not sure why.
It expects one file after calling on the script, saves each line and changes to $start==1 when it sees the string containing EQUILIBRIUM GEOMETRY which triggers recording of symbols and coordinates. It continues to save lines that contain the coordinate format until it sees a blank line where it finishes recording into $geom.
#!/usr/bin/perl
$num_args = $#ARGV + 1;
if ($num_args != 1) {
   print "\nMust supply GAMESS .log file.\n";
   exit;
}
$file = $ARGV[0];
open FILE, "<", $file;
$start = 0;
$geom="";
while (<FILE>) {
 $line = $_;
if ( $line eq "\n" && ($start == 1) ) {
   $start = 0; }
 if ( $start == 1 && $line =~ m/\s+[A-Z]+\s+[0-9\.]+\s+[0-9\.\-]+\s+[0-9\.\-]+\s+[0-9\.\-]+/ ) {
$line =~ s/^\s+//;
#coordinates = split(/\s+/,$line);
$geom=$coordinates[0],$coordinates[3],$coordinates[4],$coordinates[5];
 }
 if ( $line =~ m/\s+\*+ EQUILIBRIUM GEOMETRY LOCATED\s\*+\s+) {
   $geom = "";
   $start = 1;
 }
}
print $geom;
Error Message:
Unrecognized character \xC2; marked by <-- HERE after <-- HERE near column 1 at ./perl-grep line 5.
There is an invisible character on line 13
i have created a file with only this line (by cut/paste)
and then add one line above which is my retyping
$geom="";
$geom="";
that looks the same but it is not (the second line is the buggy one)
[tmp]=> cat x | perl -ne '$LINE = $_; $HEX = unpack "H*"; print "$HEX $LINE" '
2467656f6d3d22223b0a $geom="";
2467656f6d3d22223be280a80a $geom="";
you can see there are some more character when you hexamine the file.
So => just remove completely this single line and retype
By the way, there is another issue in your file, you miss to close the regexp '/'
if ( $line =~ m/\s+\*+ EQUILIBRIUM GEOMETRY LOCATED\s\*+\s+) {
but I guess, there are still work to do to finish your script cause i don't see too much the purpose ;)
I copied this over to my Linux box and got the same problem.
Basically, the script is saying that there's an unreadable character at the line:
$geom="";
I re-typed that line in gedit and it ran file.
Also, there's an unclosed regex at the bottom of your script. I add a "/" to the line that reads:
if ( $line =~ m/\s+\*+ EQUILIBRIUM GEOMETRY LOCATED\s\*+\s+) {
OK, first and foremost - strict and warnings are really the first port of call when you're writing a scirpt and having problems. Actually, even before you have problems - switch them on, and adhere to them*.
So as with your code:
$geom="";? - trailing question mark. Should be removed.
$num_args = $#ARGV + 1; - redundant. scalar #ARGV has the same result.
open FILE, "<", $file; - 3 arg open is good. Not using lexical filehandles or checking success is bad.
$line = $_; - redundant. Just use while ( my $line = <FH> ) { instead.
if ( $line =~ m/\s+\*+ EQUILIBRIUM GEOMETRY LOCATED\s\*+\s+) - broken regex, no trailing /.
$geom=$coordinates[0],$coordinates[3],$coordinates[4],$coordinates[5]; - doesn't do what you think. You probably want to join these or concatenate them.
$line eq "\n" - picks up blank lines, but might be better if you chomp; first and eq ''.
$start looks like you're trying to do the same thing as the range operator. http://perldoc.perl.org/perlop.html#Range-Operators
you overwrite $geom as you go. Is that your intention?
$line =~ s/^\s+//; - is redundant given all you do is split. split ' ' does the same thing.
it's good form to close your filehandle after using it. Especially when it's not lexically scoped.
So with that in mind, your code might look bit like this:
#!/usr/bin/perl
use strict;
use warnings;
if ( #ARGV != 1 and not -f $ARGV[0] ) {
print "\nMust supply GAMESS .log file.\n";
exit;
}
open( my $input_fh, "<", $ARGV[0] ) or die $!;
my $geom = "";
while ( my $line = <$input_fh> ) {
chomp $line;
if ( $line =~ m/\s+\*+ EQUILIBRIUM GEOMETRY LOCATED\s\*+\s+/ .. m/^$/ ) {
if ( $line
=~ m/\s+[A-Z]+\s+[0-9\.]+\s+[0-9\.\-]+\s+[0-9\.\-]+\s+[0-9\.\-]+/
)
{
my #coordinates = split( ' ', $line );
$geom = join( "",
$coordinates[0], $coordinates[3],
$coordinates[4], $coordinates[5] );
}
}
}
close($input_fh);
print $geom;
(If you have some sample input, I'll verify it).
* There are occasions you might want to switch them off. If you know what these are and why, then you switch them off. Otherwise just assume they're mandatory.

Values from IF statement regex match (Perl)

I'm currently extracting values from a table within a file via REGEX line matches against the table rows.
foreach my $line (split("\n", $file)) {
if ($line =~ /^(\S+)\s*(\S+)\s*(\S+)$/) {
my ($val1, $val2, $val3) = ($1, $2, $3);
# $val's used here
}
}
I purposely assign vals for clarity in the code. Some of my table rows contain 10+ vals (aka columns) - is there a more efficient method of assigning the vals instead of doing ... = ($1, $2, ..., $n)?
A match in list context yields a list of the capture groups. If it fails, it returns an empty list, which is false. You can therefore
if( my ( $val1, $val2, $val3 ) = $line =~ m/^(\S+)\s*(\S+)\s*(\S+)$/ ) {
...
}
However, a number of red flags are apparent in this code. That regexp capture looks very similar to a split:
if( my ( $val2, $val2, $val3 ) = split ' ', $line ) {
...
}
Secondly, why split $file by linefeeds; if you are reading the contents of a file, far nicer is to actually read a single line at once:
while( my $line = <$fh> ) {
...
}
I assume that this is not your actual code, because if so, it will not work:
foreach my $line (split("\n", $file)) {
if ($line =~ /^(\S+)\s*(\S+)\s*(\S+)$/) {
my ($val1, $val2, $val3) = ($1, $2, $3);
}
# all the $valX variables are now out of scope
}
You should also be aware that \s* will also match the empty string, and may cause subtle errors. For example:
"a bug" =~ /^(\S+)\s*(\S+)\s*(\S+)$/;
# the captures are now: $1 = "a"; $2 = "bu"; $3 = "g"
Even despite the fact that \S+ is greedy, the anchors ^ ... $ will force the regex to fit, hence allowing the empty strings to split the words.
If your intention is to capture all the words that are separated by whitespace, using split is your best option, as others have already mentioned.
open my $fh, "<", "file.txt" or die $!;
my #stored;
while (<$fh>) {
my #vals = split;
push(#stored, \#vals) if #vals; # ignore empty values
}
This will store any captured values into a two-dimensional array. Using the file handle directly and reading line-by-line is the preferred method, unless for some reason you actually need to have the entire file in memory.
Looks like you are just using a table with a space delimiter.You can use the split function:
#valuearray = split(" ", $line)
And then address the elements as:
#valuearray[0] ,#valuearray[1] etc..

How can I replace all the text before the match in a Perl substitution?

I am reading each line of an input file (IN) and printing the line read to an output file (OUT) if the line begins with one of the patterns, say "ab", "cd","ef","gh","ij" etc. The line printed is of form "pattern: 100" or form "pattern: 100:200". I need to replace "pattern" with "myPattern", i.e. print the current line to FILE but replace all the text before the first occurence of ":" with "myPattern". What is the best way to do this?
Currently I have:
while ( <IN> )
{
print FILE if /^ab:|^bc:|^ef:|^gh:/;
}
I am not sure if substr replacement would help as "pattern" can be either "ab" or"cd" or "ef" or "gh" etc.
Thanks!
Bi
Generically, do this like:
my %subst = ( 'ab' => 'newab', 'bc' => 'newbc', 'xy' => 'newxy' );
my $regex = join( '|', map quotemeta, sort { length($b) <=> length($a) } keys %subst );
$regex = qr/^($regex):/;
while ( <IN> ) {
print FILE if s/$regex/$subst{$1}:/;
}
The sort puts the longest ones first, so that if the data has ab:: and both ab and ab: are being substituted, ab: is used instead of ab.
Perl's substitution operator by default (a) uses the first match, (b) only replaces one match and (c) returns true if a replacement was made and false if it wasn't.
So:
while ( <IN> )
{
if (s/<pattern1>:/<replace1>/ ||
s/<pattern2>:/<replace2>/) {
print FILE;
}
}
Should work for you. Note that because of short-circuiting, only one substitution will be made.
sub replacer {
$line = shift;
$find = shift;
$replace = shift;
$line =~ /([^:]+):/
if ($1 =~ /$find/) {
$line =~ s/([^:]+):/$replace/ ;
return $line;
}
return ;
}
while (<IN>)
{
print OUT replacer ($_,"mean","variance");
print OUT replacer ($_,"pattern","newPattern");
}
My perl is a little rusty, so syntax might not be exact.
edit: Put it in a function for ya.
while ( <IN> )
{
s/^pattern:/myPattern:/;
print OUT
}
This might be what you want:
$expr = "^(ab)|(cd)|(ef)|(gh)|(ij)";
while (<IN>)
{
if (/$expr:/)
{
s/$expr/$myPattern/;
print FILE;
}
}
The shortest way to do what you ask above is to re-use your code, but include a substitution.
while ( <IN> )
{
print FILE if s/^(ab|bc|ef|gh):/MyPattern:/;
}
Any of the left hand side patterns will be replaced. If the left hand side does not match, nothing will be printed.