Perl - Split a string containing tuples using regex - regex

I have a string containing tuples like below:
"(-0.345205479452055,1.3543),(-0.26027397260274,1.218),(-0.183561643835616,1.3028)"
I am trying to split this string to an array containing just the tuples: [(-0.345205479452055,1.3543),(-0.26027397260274,1.218),(-0.183561643835616,1.3028)]
I cannot use Split function like below since the function also splits up the tuple as well. Is there a regex or some clever way to get the tuples as-is?
#Tuples = split /,/,$myString;

split can be used for this but requires a slightly more detailed expression.
my #str = "(-0.345205479452055,1.3543),(-0.26027397260274,1.218),(-0.183561643835616,1.3028)");
my #arr1 = split(/(?<=\)),(?=\()/, $str);
The key here is the use of a zero-width look-behind assertion for checking for a closing paren and the use of a zero-width look-ahead assertion (not exactly necessary here but useful to see) to check for an open paren. Check the perlre docs for more info on these.
Alternatively, if you want to avoid split altogether then you can use a global match as well.
my #arr2 = $str =~ /(\([^)]+\))/g;

If your data is structured consistently the way you showed, you can use a lookbehind to check and see if the comma comes after a parenthesis.
/(?<=\)),/
You could also use a negative lookbehind to see if a number is before the comma, and not split there, though that could be confusing to understand.
/(?<!\d),/

If there are no parenthesis inside the tuples, and no parenthesis outside the tuples, you could simply use the following regex:
my #array = $str =~ /(\(.*?\))/sg;
assuming that there is always a starting parenthesis and a matching end parenthesis for each tuple.
Here
*? is a nongreedy quantifier, see perlretut for more information,
the flag s is a regex modifier that lets the . also match newlines characters (in case your string contains newlines), see perlre for more information,
the flag g stands for global matching and allows the matching operator to match within a string as many times as possible.
In list context, g returns a list of matched groupings, or if there are no groupings, a list of matches to the whole regexp, see perlretut for more information.

Related

How to reconstruct regex matched part

I have simplify some latex math formula within text, for example
This is ${\text{BaFe}}_{2}{\text{As}}_{2}$ crystal
I want to transform this into
This is BaFe2As2 crystal
That is to concatenate only content within inner most bracket.
I figure out that I can use regex pattern
\{[^\{\}]*\}
to match those inner most bracket. But the problem is how to concatenate them together?
I don't know if this could be done in notepad++ regex replacement. If notepad++ is not capable, I can also accept perl one-liner solution.
There may clearly be multiple such equations (the markup between two $s) in the document. So while you need to assemble text between all {}, this also need be constrained within a $ pair. Then all such equations need be processed.
Matching that in a single pattern results in a complex regex. Instead, we can first extract everything within a pair of $s and then gather text within {}s from that, simplifying the regex a lot. This makes two passes over each equation but a Latex document is small for computational purposes and the loss of efficiency can't be noticed.
use warnings;
use strict;
use feature 'say';
my $text = q(This is ${\text{BaFe}}_{2}{\text{As}}_{2}$ crystal,)
. q( and ${\text{Some}}{\mathbf{More}}$ text);
my #results;
while ($text =~ /\$(.*?)\$/g) {
my $eq = $1;
push #results, join('', $eq =~ /\{([^{}]+)\}/g);
}
say for #results;
This prints lines BaFe2As2 and SomeMore.
The regex in the while condition captures all chars between two $s. After the body of the loop executes and the condition is checked again, the regex continues searching the string from the position of the previous match. This is due to the "global" modifier /g in scalar context, imposed on regex since it is in the loop condition. Once there are no more matches the loop terminates.
In the body we match between {}, and again due to /g this is done for all {}s in the equation. Here, however, the regex is in the list context (as it is assigned to an array) and then /g makes it return all matches. They are joined into a string, which is added to the array.
In order to replace the processed equation, use this in a substitution instead
$text =~ s{ \$(.*?)\$ }{ join('', $1 =~ /\{([^{}]+)\}/g) }egx;
where the modifier e makes it so that the replacement part is evaluated as Perl code, and the result of that used to replace the matched part. Then in it we can run our regex to match content of all {} and join it into the string, as explained above. I use s{}{} delimiters, and x modifier so to be able to space things in the matching part as well.
Since the whole substitution has the g modifier the regex keeps going through $text, as long as there are equations to match, replacing them with what's evaluated in the replacement part.
I use a hard-coded string (extended) from the question, for an easy demo. In reality you'd read a file into a scalar variable ("slurp" it) and process that.
This relies on the question's premise that text of interest in an equation is cleanly between {}.
Missed the part that a one-liner is sought
perl -0777 -wnE'say join("", $1=~/\{([^{}]+)\}/g) while /\$(.*?)\$/g' file.tex
With -0777 the file is read whole ("slurped"), and as -n provides a loop over input lines it is in the $_ variable; the regex in the while condition works by default on $_. In each interation of while the contents of the captured equation, in $1, is directly matched for {}s.
Then to replace each equation and print out the whole processed file
perl -0777 -wne's{\$(.*?)\$}{join "", $1=~/\{([^{}]+)\}/g}eg; print' file.tex
where I've removed extra spaces and (unnecessary) parens on join.
Use this regex in Notepad++. I have tried to match everything which is NOT present between the innermost curly brackets and then replaced the match with a blank string.
[^{}]*\{|\}[^{}]*
Click for Demo
Explanation:
[^{}]*\{ - matches 0+ occurrences of any character that is neither { nor } followed by {
| - OR
\}[^{}]* - matches } followed by 0+ occurrences of any character that is neither { nor }
Before Replacement:
After Replacement:
UPDATE:
Try this updated regex:
\$?(?=[^$]*\$[^$]*$)(?:[^{}]*{|}[^{}]*)(?=[^$]*\$[^$]*$)\$?
Click for Demo

PCRE: Searching for a string not commented or within a comment block?

I'm doing a (PCRE) search for strings, but i don't want to match any string that is commented or appears in a comment block, so, in this file:
/*
function someFuncInCommentBlock(){
return 'match this string';
}
*/
// var someVarThatsCommented = 'match this string';
var someVar = 'match this string';
function someFunc(){
return 'match this string';
}
... i would only expect to see two matches for match this string (the last two that aren't in comments). what sort of pattern syntax do i need do this?
You can use this regex:
/\*[\s\S]*?\*/(*SKIP)(*FAIL)|//.*(*SKIP)(*FAIL)|'(.*?)'
Working demo
The idea of this regex is match what you don't want and discard it by using flags (*SKIP)(*FAIL). Using this technique commonly named "discard technique" you use a chain of patterns that you want to exclude doing the following:
/\*[\s\S]*?\*/(*SKIP)(*FAIL) <--- Discard everything block comments
| or
//.*(*SKIP)(*FAIL) <--- Discard everything single comments
| or
'(.*?)' <--- Keep everything withing single quotes
In case of PCRE regex you can use the advantage of (*SKIP)(*FAIL) to say exclude everything matching this pattern.
On the other hand, regex engines that don't support these flags can achieve the same discard technique by using a regex trick that consists of the following OR patterns:
exclude this | another pattern to exclude | (save this content)
For the regex I posted, if you have to achieve the same in other regex engine you could use this regex:
/\*[\s\S]*?\*/|//.*|'(.*?)'
All the patterns to be excluded are on the left and they are separated by ORs. To the rightest side you have a capturing group that will match what you want. An easy way to see this is using a debuggex graph:
As Bark Kiers pointed in this comment, my regex will match the content within single quotes, it won't explicit match match this string. So, in order to match match this string you could change the regex to:
/\*[\s\S]*?\*/(*SKIP)(*FAIL)|//.*(*SKIP)(*FAIL)|match this string

Regex detect if a matched comma(,) does not lie in a regex

I am trying to figure out a way to determine if my matched comma(,) does not lie inside a regex. Basically, i do not want to match my character if it lies in a regex.
The regex i have come up with is ,(?<!.+\/)(?!.+\/) but its not quite working.
Any ideas?
I want to skip /some,regex/ but match any other commas.
Edit:
Live example: http://rubular.com/r/WjrwSnmzyP
Here is the regex that will work for you:
,(?!\s)(?=(?:(?:[^/]*\/){2})*[^/]*$)
Live Demo: http://rubular.com/r/37buDdg1tW
Explanation: It means match comma followed by EVEN number of forward slash /. Hence comma (,) between 2 slash (/) characters will NOT be matched and outside ones will be matched (since those are followed by even number of / characters).
A curious thing about regular expressions is that if you want to use them to ignore "something" that is within "something else", you need to match that "something else", prefer matches of it, and then either silently discard or reproduce those matches.
For example, in order to remove all commas from a string unless they are in a regular expression literal—
In Perl:
my $s = "/foo,bar/,baz";
$s =~ s{(/(?:[^/\\]|\\.)+/)|,}{\1}g;
In ECMAScript:
var s = "/foo,bar/,baz";
s = s.replace(/(\/([^\/\\]|\\.)+\/)|,/g, "$1");
or
s = s.replace(new RegExp("(/([^/\\\\]|\\\\.)+/)|,", "g"), "$1");
Note that I am capturing the match for the regular expression literal in the string value, and reproducing it (\1 or $1) if it matched. (If the other part of the alternation – the standalone comma – matched, the empty string is captured, so this simple approach suffices here.)
For further reading I recommend “Mastering Regular Expressions” by Jeffrey E. F. Friedl. Two rather enlightening example chapters, each from a different edition, are available for free online.

Replace specific capture group instead of entire regex in Perl

I've got a regular expression with capture groups that matches what I want in a broader context. I then take capture group $1 and use it for my needs. That's easy.
But how to use capture groups with s/// when I just want to replace the content of $1, not the entire regex, with my replacement?
For instance, if I do:
$str =~ s/prefix (something) suffix/42/
prefix and suffix are removed. Instead, I would like something to be replaced by 42, while keeping prefix and suffix intact.
As I understand, you can use look-ahead or look-behind that don't consume characters. Or save data in groups and only remove what you are looking for. Examples:
With look-ahead:
s/your_text(?=ahead_text)//;
Grouping data:
s/(your_text)(ahead_text)/$2/;
If you only need to replace one capture then using #LAST_MATCH_START and #LAST_MATCH_END (with use English; see perldoc perlvar) together with substr might be a viable choice:
use English qw(-no_match_vars);
$your_string =~ m/aaa (bbb) ccc/;
substr $your_string, $LAST_MATCH_START[1], $LAST_MATCH_END[1] - $LAST_MATCH_START[1], "new content";
# replaces "bbb" with "new content"
This is an old question but I found the below easier for replacing lines that start with >something to >something_else. Good for changing the headers for fasta sequences
while ($filelines=~ />(.*)\s/g){
unless ($1 =~ /else/i){
$filelines =~ s/($1)/$1\_else/;
}
}
I use something like this:
s/(?<=prefix)(group)(?=suffix)/$1 =~ s|text|rep|gr/e;
Example:
In the following text I want to normalize the whitespace but only after ::=:
some text := a b c d e ;
Which can be achieved with:
s/(?<=::=)(.*)/$1 =~ s|\s+| |gr/e
Results with:
some text := a b c d e ;
Explanation:
(?<=::=): Look-behind assertion to match ::=
(.*): Everything after ::=
$1 =~ s|\s+| |gr: With the captured group normalize whitespace. Note the r modifier which makes sure not to attempt to modify $1 which is read-only. Use a different sub delimiter (|) to not terminate the replacement expression.
/e: Treat the replacement text as a perl expression.
Use lookaround assertions. Quoting the documentation:
Lookaround assertions are zero-width patterns which match a specific pattern without including it in $&. Positive assertions match when their subpattern matches, negative assertions match when their subpattern fails. Lookbehind matches text up to the current match position, lookahead matches text following the current match position.
If the beginning of the string has a fixed length, you can thus do:
s/(?<=prefix)(your capture)(?=suffix)/$1/
However, ?<= does not work for variable length patterns (starting from Perl 5.30, it accepts variable length patterns whose length is smaller than 255 characters, which enables the use of |, but still prevents the use of *). The work-around is to use \K instead of (?<=):
s/.*prefix\K(your capture)(?=suffix)/$1/

What regular expression can remove duplicate items from a string?

Given a string of identifiers separated by :, is it possible to construct a regular expression to extract the unique identifiers into another string, also separated by :?
How is it possible to achieve this using a regular expression? I have tried s/(:[^:])(.*)\1/$1$2/g with no luck, because the (.*) is greedy and skips to the last match of $1.
Example: a:b:c:d:c:c:x:c:c:e:e:f should give a:b:c:d:x:e:f
Note: I am coding in perl, but I would very much appreciate using a regex for this.
In .NET which supports infinite repetition inside lookbehind, you could search for
(?<=\b\1:.*)\b(\w+):?
and replace all matches with the empty string.
Perl (at least Perl 5) only supports fixed-length lookbehinds, so you can try the following (using lookahead, with a subtly different result):
\b(\w+):(?=.*\b\1:?)
If you replace that with the empty string, all previous repetitions of a duplicate entry will be removed; the last one will remain. So instead of
a:b:c:d:x:e:f
you would get
a:b:d:x:c:e:f
If that is OK, you can use
$subject =~ s/\b(\w+):(?=.*\b\1:?)//g;
Explanation:
First regex:
(?<=\b\1:.*): Check if you can match the contents of backreference no. 1, followed by a colon, somewhere before in the string.
\b(\w+):?: Match an identifier (from a word boundary to the next :), optionally followed by a colon.
Second regex:
\b(\w+):: Match an identifier and a colon.
(?=.*\b\1:?): Then check whether you can match the same identifier, optionally followed by a colon, somewhere ahead in the string.
Check out: http://www.regular-expressions.info/duplicatelines.html
Always a useful site when thinking about any regular expression.
$str = q!a:b:c:d:c:c:x:c:c:e:e:f!;
1 while($str =~ s/(:[^:]+)(.*?)\1/$1$2/g);
say $str
output :
a:b:c:d:x:e:f
here's an awk version, no need regex.
$ echo "a:b:c:d:c:c:x:c:c:e:e:f" | awk -F":" '{for(i=1;i<=NF;i++)if($i in a){continue}else{a[$i];printf $i}}'
abcdxef
split the fields on ":", go through the splitted fields, store the elements in an array. check for existence and if exists, skip. Else print them out. you can translate this easily into Perl code.
If the identifiers are sorted, you may be able to do it using lookahead/lookbehind. If they aren't, then this is beyond the computational power of a regex. Now, just because it's impossible with formal regex doesn't mean it's impossible if you use some perl specific regex feature, but if you want to keep your regexes portable you need to describe this string in a language that supports variables.