Why doesn't this perl regex capture the last character? - regex

let $PWD = /Unix_Volume/Users/a/b/c/d
I would expect:
echo $PWD | perl -ne 'if( /(\w+)[^\/]/ ){ print $1; }'
to display "Unix_Volume". However, it displays "Unix_Volum." Why doesn't the regex capture the last character?

(\w+) => Unix_Volum
[^\/] => e (not a /)
/ => /

Try:
export PWD=/Unix_Volume/Users/a/b/c/d
perl -MFile::Spec -e'print((File::Spec->splitdir($ENV{_pwd}))[1],"\n")'
You should always use the modules that come with Perl where possible. For a list of them, see perldoc perlmodlib.

Since \w doesen't have a forward slash in its class, why do you need [^\/] ?
/(\w+)/ will do. It captures the first occurance of this class.
edit: /.*\b(\w+)/ to capture the last occurance.

The (\w+)group matches and captures the word characters "Unix_Volume" greedily, leaving the position at the / after "Unix_Volume".
The [^\/] class forces the engine to back up (the greedy + quantifier gives up characters it's matched to satisfy atoms that follow it) to match a character that is not "/", matching the "e" at the end of "Unix_Volume". Since the matched "e" is outside the capturing group you're left with "Unix_Volum" in $1.

Related

Non-Capturing and Capturing Groups - The right way

I'm trying to match an array of elements preceeded by a specific string in a line of text. For Example, match all pets in the text below:
fruits:apple,banana;pets:cat,dog,bird;colors:green,blue
/(?:pets:)(\w+[,|;])+/g**
Using the given regex I only could match the last word "bird"
Can anybody help me to understand the right way of using Non-Capturing and Capturing Groups?
Thanks!
First, let's talk about capturing and non-capturing group:
(?:...) non-capturing version, you're looking for this values, but don't need it
() capturing version, you want this values! You're searching for it
So:
(?:pets:) you searching for "pets" but don't want to capture it, after that point, you WANT to capture (if I've understood):
So try (?:pets:)([a-zA-Z,]+); ... You're searching for "pets:" (but don't want it !) and stop at the first ";" (and don't want it too).
Result is :
Match 1 : cat,dog,bird
A better solution exists with 1 match == 1 pet.
Since you want to have each pet in a separate match and you are using PCRE \G is, as suggested by Wiktor, a decent option:
(?:pets:)|\G(?!^)(\w+)(?:[,;]|$)
Explanation:
1st Alternative (?:pets:) to find the start of the pattern
2nd Alternative \G(?!^)(\w+)(?:[,;]|$)
\G asserts position at the end of the previous match or the start of the string for the first match
Negative Lookahead (?!^) to assert that the Regex does not match at the start of the string
(\w+) to matches the pets
Non-capturing group (?:[,;]|$) used as a delimiter (matches a single character in the list ,; (case sensitive) or $ asserts position at the end of the string
Perl Code Sample:
use strict;
use Data::Dumper;
my $str = 'fruits:apple,banana;pets:cat,dog,bird;colors:green,blue';
my $regex = qr/(?:pets:)|\G(?!^)(\w+)(?:[,;]|$)/mp;
my #result = ();
while ( $str =~ /$regex/g ) {
if ($1 ne '') {
#print "$1\n";
push #result, $1;
}
}
print Dumper(\#result);

Regular expression Capture and Backrefence

Here's the string I'm searching.
T+4ACCGT+12CAAGTACTACCGT+12CAAGTACTACCGT+4ACCGA+6CTACCGT+12CAAGTACTACCGT+12CAAGTACTACCG
I want to capture the digits behind the number for X digits (X being the previous number) I also want to capture the complete string.
ie the capture should return:
+4ACCG
+12AAGTACTACCGT
etc.
and :
ACCG
AAGTACTACCGT
etc.
Here's the regex I'm using:
(\+(\d+)([ATGCatgcnN]){\2});
and I'm using $1 and $3 for the captures.
What am I missing ?
You can not use a backreference in a quantifier. \1 is a instruction to match what $1 contains, so {\1} is not a valid quantifier. But why do you need to match the exact number? Just match the letters (because the next part starts again with a +).
So try:
(\+\d+([ATGCatgcnN]+));
and find the complete match in $1 and the letters in $2
Another problem in your regex is that your quantifier is outside your third capturing group. That way only the last letter would be in the capturing group. Place the quantifier inside the group to capture the whole sequence.
You can also remove the upper or lower case letters from your class by using the i modifier to match case independent:
/(\+\d+([ATGCN]+))/gi
This loop works because the \G assertion tells the regex engine to begin the search after the last match , (digit(s)), in the string.
$_ = 'T+4ACCGT+12CAAGTACTACCGT+12CAAGTACTACCGT+4ACCGA+6CTACCGT+12CAAGTACTACCGT+12CAAGTACTACCG';
while (/(\d+)/g) {
my $dig = $1;
/\G([TAGCN]{$dig})/i;
say $1;
}
The results are
ACCG
CAAGTACTACCG
CAAGTACTACCG
ACCG
CTACCG
CAAGTACTACCG
CAAGTACTACCG
I think this is correct but not sure :-|
Update: Added the \G assertion which tells the regex to begin immediately after the last matched number.
my #sequences = split(/\+/, $string);
for my $seq (#sequences) {
my($bases) = $seq =~ /([^\d]+)/;
}

regex, search and replace until a certain point

The Problem
I have a file full of lines like
convert.these.dots.to.forward.slashes/but.leave.these.alone/i.mean.it
I want to search and replace such that I get
convert/these/dots/to/forward/slashes/but.leave.these.alone/i.mean.it
The . are converted to / up until the first forward slash
The Question
How do I write a regex search and replace to solve my problem?
Attempted solution
I tried using look behind with perl, but variable length look behinds are not implemented
$ echo "convert.these.dots.to.forward.slashes/but.leave.these.alone/i.mean.it" | perl -pe 's/(?<=[^\/]*)\./\//g'
Variable length lookbehind not implemented in regex m/(?<=[^/]*)\./ at -e line 1.
Workaround
Variable length look aheads are implemented, so you can use this dirty trick
$ echo "convert.these.dots.to.forward.slashes/but.leave.these.alone/i.mean.it" | rev | perl -pe 's/\.(?=[^\/]*$)/\//g' | rev
convert/these/dots/to/forward/slashes/but.leave.these.alone/i.mean.it
Is there a more direct solution to this problem?
s/\G([^\/.]*)\./\1\//g
\G is an assertion that matches the point at the end of the previous match. This ensures that each successive match immediately follows the last.
Matches:
\G # start matching where the last match ended
([^\/.]*) # capture until you encounter a "/" or a "."
\. # the dot
Replaces with:
\1 # that interstitial text you captured
\/ # a slash
Usage:
echo "convert.these.dots.to.forward.slashes/but.leave.these.alone/i.mean.it" | perl -pe 's/\G([^\/.]*)\./\1\//g'
# yields: convert/these/dots/to/forward/slashes/but.leave.these.alone/i.mean.it
Alternatively, if you're a purist and don't want to add the captured subpattern back in — avoiding that may be more efficient, but I'm not certain — you could make use of \K to restrict the "real" match solely to the ., then simply replace with a /. \K essentially "forgets" what has been matched up to that point, so the final match ultimately returned is only what comes after the \K.
s/\G[^\/.]*\K\./\//g
Matches:
\G # start matching where the last match ended
[^\/.]* # consume chars until you encounter a "/" or a "."
\K # "forget" what has been consumed so far
\. # the dot
Thus, the entirety of the text matched for replacement is simply ".".
Replaces with:
\/ # a slash
Result is the same.
You can use substr as an lvalue and perform the substitution on it. Or transliteration, like I did below.
$ perl -pe 'substr($_,0,index($_,"/")) =~ tr#.#/#'
convert.these.dots.to.forward.slashes/but.leave.these.alone/i.mean.it
convert/these/dots/to/forward/slashes/but.leave.these.alone/i.mean.it
This finds the first instance of a slash, extracts the part of the string before it, and performs a transliteration on that part.

What do these regular expressions mean?

I'm venturing to read a code in Perl and found the following regular expressions:
$str =~ s/(<.+?>)|(&\w+;)/ /gis;
$str =~ /(\w+)/gis
I wonder what these codes represent.
Can anyone help me?
The first one $str =~ s/(<.+?>)|(&\w+;)/ /gis; does a sustitution:
$str : the variable to work on
=~ : do the subs and save in the same variable
s : substitution operator
/ : begining or the regex
( : begining of captured group 1
< : <
.+? : one or more of any char NOT greedy
> : >
) : end of capture group 1
| : alternation
( : begining of captured group 2
& : &
\w+ : one or more word char ie: [a-zA-Z0-9_]
; : ;
) : end of group 2
/ : end of search part
: a space
/ : end of replace part
gis; : global, case insensitive, multi-line
This will replace all tags and encoded element like & or < by a space.
The second one expect that it left at least one word.
One way to help deciper regular expressions is to use the YAPE::Regex::Explain module from CPAN:
#!/usr/bin/env perl
use YAPE::Regex::Explain;
#...may need to single quote $ARGV[0] for the shell...
print YAPE::Regex::Explain->new( $ARGV[0] )->explain;
Assuming this snippet is named 'rexplain' you would do:
$ ./rexplain 's/(<.+?>)|(&\w+;)/ /gis'
The first strips out every XML/HTML tag and every character entity, replacing each one with a space. The second finds every substring consisting entirely of word characters.
In detail:
The first part of the first expression first matches a <, then any character with the . (newlines included thanks to the /s flag at the end). The + modifier would match one or more characters up until the last > found in $str, but the ? after it makes it not greedy, so it only matches up to the first > encountered. The second part matches & followed by any word character until ; is found. Since ; is not a word character, the ? modifier is not needed. The s/ up front means a substitution, and the bit after the second / means that's what any match is substituted with. The /gis at the end means *g*reedy, case *i*nsensitive, and *s*ingle line.
The second expression finds the first substring of non-word characters and puts it in $1. If you call it repeatedly, the /g at the end means that it will keep matching every instance in $str.
The first one takes a string and replaces html tags or html character codes with a space
The second one makes sure there is still a word left when done.
These "codes" are regular expressions. Type this to learn more:
perldoc perlre
The code above replaces with blanks some HTML/XML tags and some URL-encoded characters such as
from $str. But there are better ways to do this using CPAN modules. The code then tries to match and capture into variable $1 the first word in $str.
Ex:
perl -le '$str = "foo<br> bar<another\ntag>baz"; print $str; $str =~ s/(<.+?>)|(&\w+;)/ /gis; $str =~ /(\w+)/gis; print $str; print $1;'
It prints:
foo<br> bar<another
tag>baz
foo bar baz
foo

Non greedy LookAhead

I have strings like follows:
val:key
I can capture 'val' with /^\w*/.
How can I now get 'key' without the ':' sign?
Thanks
How about this?
/^(\w+):(\w+)$/
Or if you just want to capture everything after the colon:
/:(.+)/
Here's a less clear example using a lookbehind assertion to ensure a colon occurred before the match - the entire match will not include that colon.
/(?<=:).*/
What language are you using? /:(.*)/ doesn't capture the ":" but it does match the ':'
In Perl, if you say:
$text =~ /\:(.*)/;
$capture = $1;
$match = $&;
Then $capture won't have the ":" and $match will. (But try to avoid using $& as it slows down Perl: this was just to illustrate the match).
This will capture the key in group 1 and the value in group 2. It should work correctly even when the value contails a colon (:) character.
^(\w+?):(.*)
/\:(\w*)/
That looks for : and then captures all the word characters after it till the end of the string