How to match thing that is not in capture buffer in perl - regex

Here is the sample script with my problem:
#!/usr/bin/perl
use strict;
use warnings FATAL => 'all';
use feature 'say';
my $string = "aaabc";
my $re = qr/
^ # Start of line
(.) # Now \1 has 'a'
.*? #
([^\1]) # This is incorrect. It does not work as I need
# Here I need to match the thing that is not \1
# (in this case I need to match 'b')
/x;
if ($string =~ $re) {
say $1;
say $2;
} else {
say 'no match';
}

you need a Negative Lookahead. This will find the pattern and start the rest of the search from there. Meaning the next capture will be the one you seek.
my $re = qr/
^ # Start of line
(.) # Now \1 has 'a'
.*? # also (.)+? works as first expression.
(?!\1) # Negative Lookahead is non-capturing
(.) # $2 is b
/x;

As suggested by #DeVadder, you could make use of (?>pattern) which is:
an "independent" subexpression, one which matches the substring that a
standalone pattern would match if anchored at the given position, and
it matches nothing other than this substring.
my $re = qr/
^ # Start of line
(.) # Now \1 has 'a'
(?>\1*) # Matches \1
(.)
/x;
This would handle both cases as expected.

The regex searches captures first character and use it as \1*. Finally get a character that might be same as \1 or different if exists and check if $1 and $2 are same. If they are same then there is no character other than $1. If we have a character then we have a match and $1 ne $2.
#!/usr/bin/perl
use strict;
use warnings FATAL => 'all';
use feature 'say';
while(<DATA>){
my $re = qr/^(.)\1*(.)/x;
if ($_=~$re && $1 ne $2) {
say $1;
say $2;
} else {
say 'no match';
}
}
__DATA__
aaaa
aaabc
abc
baacd
Output :
no match
a
b
a
b
b
a

Related

Perl Regex Remove Hyphen but Ignore Specific Hyphenated words

I have a perl regex which converts hyphens to spaces eg:-
$string =~ s/-/ /g;
I need to modify this to ignore specific hyphenated phrases and not replace the hyphen e.g. in a string like this:
"use-either-dvi-d-or-dvi-i"
I wish to NOT replace the hyphen in dvi-d and dvi-i so it reads:
"use either dvi-d or dvi-i"
I have tried various negative look ahead matches but failed miserably.
You can use this PCRE regex with verbs (*SKIP)(*F) to skip certain words from your match:
dvi-[id](*SKIP)(*F)|-
RegEx Demo
This will skip words dvi-i and dvi-d for splitting due to use of (*SKIP)(*F).
For your code:
$string =~ s/dvi-[id](*SKIP)(*F)|-/ /g;
Perl Code Demo
There is an alternate lookarounds based solution as well:
/(?<!dvi)-|-(?![di])/
Which basically means match hyphen if it is not preceded by dvi OR if it is not followed by d or i, thus making sure to not match - when we have dvi on LHS and [di] on RHS.
Perl code:
$string =~ s/(?<!dvi)-|-(?![di])/ /g;
Perl Code Demo 2
$string =~ s/(?<!dvi)-(?![id])|(?<=dvi)-(?![id])|(?<!dvi)-(?=[id])/ /g;
While using just (?<!dvi)-(?![id]) you will exclude also dvi-x or x-i, where x can be any character.
It is unlikely that you could get a simple and straightforward regex solution to this. However, you could try the following:
#!/usr/bin/env perl
use strict;
use warnings;
my %whitelist = map { $_ => 1 } qw( dvi-d dvi-i );
my $string = 'use-either-dvi-d-or-dvi-i';
while ( $string =~ m{ ( [^-]+ ) ( - ) ( [^-]+ ) }gx ) {
my $segment = substr($string, $-[0], $+[0] - $-[0]);
unless ( $whitelist{ $segment } ) {
substr( $string, $-[2], 1, ' ');
}
pos( $string ) = $-[ 3 ];
}
print $string, "\n";
The #- array contains the starting offsets of matched groups, and the #+ array contains the ends offsets. In both cases, element 0 refers to the whole match.
I had to resort to something like this because of how \G works:
Note also that s/// will refuse to overwrite part of a substitution that has already been replaced; so for example this will stop after the first iteration, rather than iterating its way backwards through the string:
$_ = "123456789";
pos = 6;
s/.(?=.\G)/X/g;
print; # prints 1234X6789, not XXXXX6789
Maybe #tchrist can figure out how to bend various assertions to his will.
we can ignore specific words using negative Look-ahead and negative Look-behind
Example :
(?!pattern)
is a negative look-ahead assertion
in your case the pattern is
$string =~ s/(?<!dvi)-(?<![id])/ /g;
output :
use either dvi-d or dvi-i
Reference : http://www.perlmonks.org/?node_id=518444
Hope this will help you.

Perl Regex To Remove Commas Between Quotes?

I am trying to remove commas between double quotes in a string, while leaving other commas intact? (This is an email address which sometimes contains spare commas). The following "brute force" code works OK on my particular machine, but is there a more elegant way to do it, perhaps with a single regex?
Duncan
$string = '06/14/2015,19:13:51,"Mrs, Nkoli,,,ka N,ebedo,,m" <ubabankoffice93#gmail.com>,1,2';
print "Initial string = ", $string, "<br>\n";
# Extract stuff between the quotes
$string =~ /\"(.*?)\"/;
$name = $1;
print "name = ", $1, "<br>\n";
# Delete all commas between the quotes
$name =~ s/,//g;
print "name minus commas = ", $name, "<br>\n";
# Put the modified name back between the quotes
$string =~ s/\"(.*?)\"/\"$name\"/;
print "new string = ", $string, "<br>\n";
You can use this kind of pattern:
$string =~ s/(?:\G(?!\A)|[^"]*")[^",]*\K(?:,|"(*SKIP)(*FAIL))//g;
pattern details:
(?: # two possible beginnings:
\G(?!\A) # contiguous to the previous match
| # OR
[^"]*" # all characters until an opening quote
)
[^",]* #"# all that is not a quote or a comma
\K # discard all previous characters from the match result
(?: # two possible cases:
, # a comma is found, so it will be replaced
| # OR
"(*SKIP)(*FAIL) #"# when the closing quote is reached, make the pattern fail
# and force the regex engine to not retry previous positions.
)
If you use an older perl version, \K and the backtracking control verbs may be not supported. In this case you can use this pattern with capture groups:
$string =~ s/((?:\G(?!\A)|[^"]*")[^",]*)(?:,|("[^"]*(?:"|\z)))/$1$2/g;
One way would be to use the nice module Text::ParseWords to isolate the specific field and perform a simple transliteration to get rid of the commas:
use strict;
use warnings;
use Text::ParseWords;
my $str = '06/14/2015,19:13:51,"Mrs, Nkoli,,,ka N,ebedo,,m" <ubabankoffice93#gmail.com>,1,2';
my #row = quotewords(',', 1, $str);
$row[2] =~ tr/,//d;
print join ",", #row;
Output:
06/14/2015,19:13:51,"Mrs Nkolika Nebedom" <ubabankoffice93#gmail.com>,1,2
I assume that no commas can appear legitimately in your email field. Otherwise some other replacement method is required.

Perl Regex Multiple Matches

I'm looking for a regular expression that will behave as follows:
input: "hello world."
output: he, el, ll, lo, wo, or, rl, ld
my idea was something along the lines of
while($string =~ m/(([a-zA-Z])([a-zA-Z]))/g) {
print "$1-$2 ";
}
But that does something a little bit different.
It's tricky. You have to capture it, save it, and then force a backtrack.
You can do that this way:
use v5.10; # first release with backtracking control verbs
my $string = "hello, world!";
my #saved;
my $pat = qr{
( \pL {2} )
(?{ push #saved, $^N })
(*FAIL)
}x;
#saved = ();
$string =~ $pat;
my $count = #saved;
printf "Found %d matches: %s.\n", $count, join(", " => #saved);
produces this:
Found 8 matches: he, el, ll, lo, wo, or, rl, ld.
If you do not have v5.10, or you have a headache, you can use this:
my $string = "hello, world!";
my #pairs = $string =~ m{
# we can only match at positions where the
# following sneak-ahead assertion is true:
(?= # zero-width look ahead
( # begin stealth capture
\pL {2} # save off two letters
) # end stealth capture
)
# succeed after matching nothing, force reset
}xg;
my $count = #pairs;
printf "Found %d matches: %s.\n", $count, join(", " => #pairs);
That produces the same output as before.
But you might still have a headache.
No need "to force backtracking"!
push #pairs, "$1$2" while /([a-zA-Z])(?=([a-zA-Z]))/g;
Though you might want to match any letter rather than the limited set you specified.
push #pairs, "$1$2" while /(\pL)(?=(\pL))/g;
Yet another way to do it. Doesn't use any regexp magic, it does use nested maps but this could easily be translated to for loops if desired.
#!/usr/bin/env perl
use strict;
use warnings;
my $in = "hello world.";
my #words = $in =~ /(\b\pL+\b)/g;
my #out = map {
my #chars = split '';
map { $chars[$_] . $chars[$_+1] } ( 0 .. $#chars - 1 );
} #words;
print join ',', #out;
print "\n";
Again, for me this is more readable than a strange regex, YMMV.
I would use captured group in lookahead..
(?=([a-zA-Z]{2}))
------------
|->group 1 captures two English letters
try it here
You can do this by looking for letters and using the pos function to make use of the position of the capture, \G to reference it in another regex, and substr to read a few characters from the string.
use v5.10;
use strict;
use warnings;
my $letter_re = qr/[a-zA-Z]/;
my $string = "hello world.";
while( $string =~ m{ ($letter_re) }gx ) {
# Skip it if the next character isn't a letter
# \G will match where the last m//g left off.
# It's pos() in a regex.
next unless $string =~ /\G $letter_re /x;
# pos() is still where the last m//g left off.
# Use substr to print the character before it (the one we matched)
# and the next one, which we know to be a letter.
say substr $string, pos($string)-1, 2;
}
You can put the "check the next letter" logic inside the original regex with a zero-width positive assertion, (?=pattern). Zero-width meaning it is not captured and does not advance the position of a m//g regex. This is a bit more compact, but zero-width assertions get can get tricky.
while( $string =~ m{ ($letter_re) (?=$letter_re) }gx ) {
# pos() is still where the last m//g left off.
# Use substr to print the character before it (the one we matched)
# and the next one, which we know to be a letter.
say substr $string, pos($string)-1, 2;
}
UPDATE: I'd originally tried to capture both the match and the look ahead as m{ ($letter_re (?=$letter_re)) }gx but that didn't work. The look ahead is zero-width and slips out of the match. Other's answers showed that if you put a second capture inside the look-ahead then it can collapse to just...
say "$1$2" while $string =~ m{ ($letter_re) (?=($letter_re)) }gx;
I leave all the answers here for TMTOWTDI, especially if you're not a regex master.

Perl Extracting Text

I have been working on this for so long!
I'd appreciate your help...
What my doc will look like:
<text>
<text> command <+>= "stuff_i_need" <text>
<text>
<text> command <+>= stuff <text>
<text>
<text> command <+>= -stuff <text>
<text>
Anything with tangle brackets around it is optional
stuff could be anything (apple, orange, banana) but it is what I need to extract
the command is fixed
My code so far:
#!/usr/bin/env perl
use warnings;
use strict;
use Text::Diff;
# File Handlers
open(my $ofh, '>in.txt');
open(my $ifh, '<out.txt');
while (<$ifh>)
{
# Read in a line
my $line = $_;
chomp $line;
# Extract stuff
my $extraction = $line;
if ($line =~ /command \+= /i) {
$extraction =~ s/.*"(.*)".*/$1/;
# Write to file
print $ofh "$extraction\n";
}
}
Based on the example input:
if ($line =~ /command\d*\s*\+?=\s*["-]?(\w+)"?/i) {
$extraction = $1;
print "$extraction\n";
}
A few things:
For extraction, don't use substitution (i.e., use m// and not s///). If you use a match, the parenthetical groups inside the match will be returned as a list (and assigned to $1, $2, $3, etc. if you prefer).
The =~ binds the variable you want to match. So you want $extraction to actually be $line.
Your .* match is too greedy and will prevent the match from succeeding the way you want. What I mean by "greedy" is that .* will match the trailing " in your lines. It will consume the rest of the input on the line and then try match that " and fail because you've reached the end of the line.
You want to specify what the word could be. For example, if it's letters, then match [a-zA-Z]
my ($extraction) = $line =~ /command \+= "([a-zA-Z]*)"/;
If it's a number, you want [0-9]:
my ($extraction) = $line =~ /command \+= "([0-9]*)"/;
If it could be anything except ", use [^"], which means "anything but "":
my ($extraction) = $line =~ /command \+= "([^"]*)"/;
It usually helps to try to match against just what you are looking for rather than the blanket .*.
The following regular expression would help you:
m{
(?<= = ) # Find an `=`
\s* # Match 0 or more whitespaces
(?: # Do not capture
[ " \- ] # Match either a `"` or a `-`
)? # Match once or never
( # Capture
[^ " \s ]+ # Match anything but a `"` or a whitespace
)
}x;
The following one-liner will extract a word (a sequence of characters without spaces) that follows an equal sign prefixed by an optional plus sign, surrounded by optional quotes. It will read from in.txt and write to out.txt.
perl -lne 'push #a, $1 if /command\s*\+?=\s*("?\S+"?)/ }{
print for #a' in.txt > out.txt
The full code - if you prefer script form - is:
BEGIN { $/ = "\n"; $\ = "\n"; }
LINE: while (defined($_ = <ARGV>)) {
chomp $_;
push #a, $1 if /command\s*\+?=\s*("?\S+"?)/;
}
{
print $_ foreach (#a);
}
Courtesy of the Deparse function of the O module.
A light solution.
#!/usr/bin/env perl
use warnings;
use strict;
open my $ifh, '<','in.txt';
open my $ofh, '>', 'out.txt';
while (<$ifh>)
{
if (/
\s command\s\+?=\s
(?:-|("))? # The word can be preceded by an optional - or "
(\w+)
(?(1)\1)\s+ # If the word is preceded by a " it must be end
# with a "
/x)
{
print $ofh $2."\n";
}
}

Perl Regex to match only 1 domain

I'm trying to create a regex which matches the following:
part1#domain.com
part1: where part1 is any 5 digit number from 0-9
part2: [optional] where #domain.com are all domains except #yahoo.com
example: 12345#yahoo.com
I'm not able to find how to insert a conditional into the regex. Now only my regex match digits + domain. Still need to figure out:
how to match only the digits
conditional to accept all domains except #yahoo.com
Code:
#!/usr/bin/perl
use strict;
use warnings;
my $regex1 = '^(\d{5})([#]([a-zA-Z0-9_-]+?\.[a-zA-Z]{2,6})+?)';
while ( my $line = <DATA> ) {
chomp $line;
if ($line =~ /$regex1/)
{
print "MATCH FOR:\t$line \n";
}
}
Sample data:
1234
12345#
12345#tandberg
A12345#tandberg.com
12345
12345#tandberg.com
12345#cisco.com
12345#tandberg.amer.com
12345#tandberg.demo
why not simply first check for yahoo.com and if you get a match go to the next line:
while ( my $line = <DATA> ) {
chomp $line;
next if ($line =~ /yahoo\.com$/);
if ($line =~ /$regex1/)
{
print "MATCH FOR:\t$line \n";
}
}
How about this?
\d{5}(?:#(?!yahoo)[a-zA-Z0-9.]+\.[a-zA-Z]{2,3})?
In expanded form:
\d{5} # 5 digits
(?: # begin a grouping
# # literal # symbol
(?!yahoo\.com) # don't allow something that matches 'yahoo.com' to match here
[a-zA-Z0-9.]+ # one or more alphanumerics and periods
\. # a literal period
[a-zA-Z]{2,3} # 2-3 letters
) # end grouping
? # make the previous item (the group) optional
(?!yahoo\.com) is what's called a "negative lookahead assertion".