Perl Regex to match only 1 domain - regex

I'm trying to create a regex which matches the following:
part1#domain.com
part1: where part1 is any 5 digit number from 0-9
part2: [optional] where #domain.com are all domains except #yahoo.com
example: 12345#yahoo.com
I'm not able to find how to insert a conditional into the regex. Now only my regex match digits + domain. Still need to figure out:
how to match only the digits
conditional to accept all domains except #yahoo.com
Code:
#!/usr/bin/perl
use strict;
use warnings;
my $regex1 = '^(\d{5})([#]([a-zA-Z0-9_-]+?\.[a-zA-Z]{2,6})+?)';
while ( my $line = <DATA> ) {
chomp $line;
if ($line =~ /$regex1/)
{
print "MATCH FOR:\t$line \n";
}
}
Sample data:
1234
12345#
12345#tandberg
A12345#tandberg.com
12345
12345#tandberg.com
12345#cisco.com
12345#tandberg.amer.com
12345#tandberg.demo

why not simply first check for yahoo.com and if you get a match go to the next line:
while ( my $line = <DATA> ) {
chomp $line;
next if ($line =~ /yahoo\.com$/);
if ($line =~ /$regex1/)
{
print "MATCH FOR:\t$line \n";
}
}

How about this?
\d{5}(?:#(?!yahoo)[a-zA-Z0-9.]+\.[a-zA-Z]{2,3})?
In expanded form:
\d{5} # 5 digits
(?: # begin a grouping
# # literal # symbol
(?!yahoo\.com) # don't allow something that matches 'yahoo.com' to match here
[a-zA-Z0-9.]+ # one or more alphanumerics and periods
\. # a literal period
[a-zA-Z]{2,3} # 2-3 letters
) # end grouping
? # make the previous item (the group) optional
(?!yahoo\.com) is what's called a "negative lookahead assertion".

Related

Perl Regex To Remove Commas Between Quotes?

I am trying to remove commas between double quotes in a string, while leaving other commas intact? (This is an email address which sometimes contains spare commas). The following "brute force" code works OK on my particular machine, but is there a more elegant way to do it, perhaps with a single regex?
Duncan
$string = '06/14/2015,19:13:51,"Mrs, Nkoli,,,ka N,ebedo,,m" <ubabankoffice93#gmail.com>,1,2';
print "Initial string = ", $string, "<br>\n";
# Extract stuff between the quotes
$string =~ /\"(.*?)\"/;
$name = $1;
print "name = ", $1, "<br>\n";
# Delete all commas between the quotes
$name =~ s/,//g;
print "name minus commas = ", $name, "<br>\n";
# Put the modified name back between the quotes
$string =~ s/\"(.*?)\"/\"$name\"/;
print "new string = ", $string, "<br>\n";
You can use this kind of pattern:
$string =~ s/(?:\G(?!\A)|[^"]*")[^",]*\K(?:,|"(*SKIP)(*FAIL))//g;
pattern details:
(?: # two possible beginnings:
\G(?!\A) # contiguous to the previous match
| # OR
[^"]*" # all characters until an opening quote
)
[^",]* #"# all that is not a quote or a comma
\K # discard all previous characters from the match result
(?: # two possible cases:
, # a comma is found, so it will be replaced
| # OR
"(*SKIP)(*FAIL) #"# when the closing quote is reached, make the pattern fail
# and force the regex engine to not retry previous positions.
)
If you use an older perl version, \K and the backtracking control verbs may be not supported. In this case you can use this pattern with capture groups:
$string =~ s/((?:\G(?!\A)|[^"]*")[^",]*)(?:,|("[^"]*(?:"|\z)))/$1$2/g;
One way would be to use the nice module Text::ParseWords to isolate the specific field and perform a simple transliteration to get rid of the commas:
use strict;
use warnings;
use Text::ParseWords;
my $str = '06/14/2015,19:13:51,"Mrs, Nkoli,,,ka N,ebedo,,m" <ubabankoffice93#gmail.com>,1,2';
my #row = quotewords(',', 1, $str);
$row[2] =~ tr/,//d;
print join ",", #row;
Output:
06/14/2015,19:13:51,"Mrs Nkolika Nebedom" <ubabankoffice93#gmail.com>,1,2
I assume that no commas can appear legitimately in your email field. Otherwise some other replacement method is required.

How to match thing that is not in capture buffer in perl

Here is the sample script with my problem:
#!/usr/bin/perl
use strict;
use warnings FATAL => 'all';
use feature 'say';
my $string = "aaabc";
my $re = qr/
^ # Start of line
(.) # Now \1 has 'a'
.*? #
([^\1]) # This is incorrect. It does not work as I need
# Here I need to match the thing that is not \1
# (in this case I need to match 'b')
/x;
if ($string =~ $re) {
say $1;
say $2;
} else {
say 'no match';
}
you need a Negative Lookahead. This will find the pattern and start the rest of the search from there. Meaning the next capture will be the one you seek.
my $re = qr/
^ # Start of line
(.) # Now \1 has 'a'
.*? # also (.)+? works as first expression.
(?!\1) # Negative Lookahead is non-capturing
(.) # $2 is b
/x;
As suggested by #DeVadder, you could make use of (?>pattern) which is:
an "independent" subexpression, one which matches the substring that a
standalone pattern would match if anchored at the given position, and
it matches nothing other than this substring.
my $re = qr/
^ # Start of line
(.) # Now \1 has 'a'
(?>\1*) # Matches \1
(.)
/x;
This would handle both cases as expected.
The regex searches captures first character and use it as \1*. Finally get a character that might be same as \1 or different if exists and check if $1 and $2 are same. If they are same then there is no character other than $1. If we have a character then we have a match and $1 ne $2.
#!/usr/bin/perl
use strict;
use warnings FATAL => 'all';
use feature 'say';
while(<DATA>){
my $re = qr/^(.)\1*(.)/x;
if ($_=~$re && $1 ne $2) {
say $1;
say $2;
} else {
say 'no match';
}
}
__DATA__
aaaa
aaabc
abc
baacd
Output :
no match
a
b
a
b
b
a

Perl Extracting Text

I have been working on this for so long!
I'd appreciate your help...
What my doc will look like:
<text>
<text> command <+>= "stuff_i_need" <text>
<text>
<text> command <+>= stuff <text>
<text>
<text> command <+>= -stuff <text>
<text>
Anything with tangle brackets around it is optional
stuff could be anything (apple, orange, banana) but it is what I need to extract
the command is fixed
My code so far:
#!/usr/bin/env perl
use warnings;
use strict;
use Text::Diff;
# File Handlers
open(my $ofh, '>in.txt');
open(my $ifh, '<out.txt');
while (<$ifh>)
{
# Read in a line
my $line = $_;
chomp $line;
# Extract stuff
my $extraction = $line;
if ($line =~ /command \+= /i) {
$extraction =~ s/.*"(.*)".*/$1/;
# Write to file
print $ofh "$extraction\n";
}
}
Based on the example input:
if ($line =~ /command\d*\s*\+?=\s*["-]?(\w+)"?/i) {
$extraction = $1;
print "$extraction\n";
}
A few things:
For extraction, don't use substitution (i.e., use m// and not s///). If you use a match, the parenthetical groups inside the match will be returned as a list (and assigned to $1, $2, $3, etc. if you prefer).
The =~ binds the variable you want to match. So you want $extraction to actually be $line.
Your .* match is too greedy and will prevent the match from succeeding the way you want. What I mean by "greedy" is that .* will match the trailing " in your lines. It will consume the rest of the input on the line and then try match that " and fail because you've reached the end of the line.
You want to specify what the word could be. For example, if it's letters, then match [a-zA-Z]
my ($extraction) = $line =~ /command \+= "([a-zA-Z]*)"/;
If it's a number, you want [0-9]:
my ($extraction) = $line =~ /command \+= "([0-9]*)"/;
If it could be anything except ", use [^"], which means "anything but "":
my ($extraction) = $line =~ /command \+= "([^"]*)"/;
It usually helps to try to match against just what you are looking for rather than the blanket .*.
The following regular expression would help you:
m{
(?<= = ) # Find an `=`
\s* # Match 0 or more whitespaces
(?: # Do not capture
[ " \- ] # Match either a `"` or a `-`
)? # Match once or never
( # Capture
[^ " \s ]+ # Match anything but a `"` or a whitespace
)
}x;
The following one-liner will extract a word (a sequence of characters without spaces) that follows an equal sign prefixed by an optional plus sign, surrounded by optional quotes. It will read from in.txt and write to out.txt.
perl -lne 'push #a, $1 if /command\s*\+?=\s*("?\S+"?)/ }{
print for #a' in.txt > out.txt
The full code - if you prefer script form - is:
BEGIN { $/ = "\n"; $\ = "\n"; }
LINE: while (defined($_ = <ARGV>)) {
chomp $_;
push #a, $1 if /command\s*\+?=\s*("?\S+"?)/;
}
{
print $_ foreach (#a);
}
Courtesy of the Deparse function of the O module.
A light solution.
#!/usr/bin/env perl
use warnings;
use strict;
open my $ifh, '<','in.txt';
open my $ofh, '>', 'out.txt';
while (<$ifh>)
{
if (/
\s command\s\+?=\s
(?:-|("))? # The word can be preceded by an optional - or "
(\w+)
(?(1)\1)\s+ # If the word is preceded by a " it must be end
# with a "
/x)
{
print $ofh $2."\n";
}
}

perl regular expressions replacement

I haven't been able to figure out how to deal with a specific regex problem.
Say I have the a big string that consists of lots of phrases in square brackets. A phrase label (eg S or VP), a token (eg w or wSf), a slash next to that token and then the token's description, (eg CC or VBD_MS3).
So here's an example string:
[S w#/CC] [VP mSf/VBD_MS3]
I want to delete the whole first bracketed phrase and put the w inside of it with the second phrase, like this:
[VP wmSf/VBD_MS3]
Is that even possible using regular expressions?
Edit:
Okay the pattern is:
[ <label> w#/<label>] [<label> <word>/<label> <word>/<label> <word>/<label>...]
(the second bracketed phrase could have one to any number of / pairs)
where can be any sequence of capital letters that might include an underscore, and word can a sequence of anything that's not whitespace (ie digits/characters/special characters).
Without knowing the actual form or positions, one of these forms might work (untested):
s{\[S (\w+)#/\w+\] (\[VP )(\w+/\w+\])}{$2$1$3}g
or
s{\[(?:S/VP) (\w+)#/\w+\] (\[(?:S/VP) )(\w+/\w+\])}{$2$1$3}g
or
s{\[(?:S/VP)\s+(\w+)#/\w+\]\s+(\[(?:S/VP)\s+)(\w+/\w+\])}{$2$1$3}g
Edit
Since your edit has included this pattern
[ <label> w#/<label>] [<label> <word>/<label> <word>/<label> <word>/<label>...]
it makes it easier to come up with a regex that should work.
Good luck!
use strict;
use warnings;
$/ = undef;
my $data = <DATA>;
my $regex = qr{
\[\s* #= Start of token phrase '['
(?&label) \s+ # <label> then whitespace's
((?&word)) # Capture $1 - token word, end grp $1
[#]/(?&label) # '#'/<label>
\s*
\] #= End of token phrase ']'
\s*
( # Capture grp $2
\[\s* #= Start of normal phrase '['
(?&label) \s+ # <label> then whitespace's
) # End grp $2
( # Capture grp $3
(?&word)/(?&label) # First <word>/<label> pair
(?:
\s+(?&word)/(?&label) # Optional, many <word>/<label> pair's
)*
\s*
\] #= End of normal phrase ']'
) # End grp $3
(?(DEFINE) ## DEFINE's:
(?<label> \w+) # <label> - 1 or more word characters
(?<word> [^\s\[\]]+ ) # <word> - 1 or more NOT whitespace, '[' nor ']'
)
}x;
$data =~ s/$regex/$2$1$3/g;
print $data;
__DATA__
[S w#/CC] [VP mSf/VBD_MS3]
Output:
[VP wmSf/VBD_MS3]
Edit2
"if the label of the character is PP, and if the next phrase's label is NP, then change the next phrase's label to PP as well when joining. eg. input: [PP w#/IN] [NP something/NN] output: [PP wsomething/NN]"
Sure, without adding too many new capture groups, it can be done with a callback.
Actually, there are many ways to do this, including regex conditionals. I think the
simplest method is with a callback, where the logic for all label decisions can be made.
use strict;
use warnings;
$/ = undef;
my $data = <DATA>;
my $regex = qr{
( \[\s* # 1 - Token phrase label
(?&label)
\s+
)
( # 2 - Token word
(?&word)
)
[#]/(?&label)
\s*
\]
\s*
( \[\s* # 3 - Normal phrase label
(?&label)
\s+
)
# insert token word ($2) here
( # 4 - The rest ..
(?&word)/(?&label)
(?: \s+ (?&word)/(?&label) )*
\s*
\]
)
(?(DEFINE) ## DEFINE's:
(?<label> \w+) # <label> - 1 or more word characters
(?<word> [^\s\[\]]+ ) # <word> - 1 or more NOT whitespace, '[' nor ']'
)
}x;
$data =~ s/$regex/ checkLabel($1,$3) ."$2$4"/eg;
sub checkLabel
{
my ($p1, $p2) = #_;
if ($p1 =~ /\[\s*PP\s/ && $p2 =~ /(\[\s*)NP(\s)/) {
return $1.'PP'.$2;
# To use the formatting of the token label, just 'return $p1;'
}
return $p2;
}
print $data;
__DATA__
[PP w#/CC] [ NP mSf/VBD_MS3]
Yes,
s|\[S w#/CC\] \[(VP) (mSf/VBD_MS3)\]|[$1 w$2]|;
Now what patterns are you looking for?
You could even do this:
s|\[S (w)#/CC\] \[(VP) (mSf/VBD_MS3)\]|[$2 $1$3]|;
Rather than create a magic regex to do the whole job, why not separate the line into phrases, operate on them then return them. This then follows the same logic that you just explained.
This then cleaner, more readable (especially if you add comments) and robust. Of course you will need to tailor to your needs: for example you may want to make the / separated portions into key/value pairs (does the order matter? if not make a hashref); perhaps you don't need to split on / if you never need to modify the label; etc.
Edit per comments:
This takes a literal w before a #, stores it, removes the phrase, then tacks the w onto the next phrase. If thats what you need then have at it. Of course I'm sure there are edge cases to look out for, so backup and test first!
#!/usr/bin/env perl
use strict;
use warnings;
while( my $line = <DATA> ) {
#separate phrases, then split phases into whitespace separated pieces
my #phrases = map { [split /[\s]/] } ($line =~ /\[([^]]+)\]/g);
my $holder; # holder for 'w' (not really needed if always 'w')
foreach my $p (#phrases) { # for each phrase
if ($p->[1] =~ /(w)#/) { # if the second part has 'w#'
$holder = $1; # keep the 'w' in holder
$p = undef; #empty to mark for cleaning later
next; #move to next phrase
}
if ($holder) { #if the holder is not empty
$p->[1] = $holder . $p->[1]; # add the contents of the holder to the second part of this phrase
$holder = undef; # and then empty the holder
}
}
#remove emptied phrases
#phrases = grep { $_ } #phrases;
#reconstitute the line
print join( ' ', map { '[' . join(' ', #$_) . ']' } #phrases), "\n";
}
__DATA__
[S w#/CC] [VP mSf/VBD_MS3]
Again, it may seem amazing what you can do with one regex, but what happens if your boss comes in and says, "you know, that thing you wrote to do X works great, but now it needs to do Y too". This is why I like to keep nicely separate logic for each logical step.
#/usr/bin/env perl
use strict;
use warnings;
my $str = "[S w#/CC] [VP mSf/VBD_MS3]";
$str =~ s{\[S w#/CC\]\s*(\[VP\s)(.+)}{$1w$2} and print $str;

In Perl, how can I get the matched substring from a regex?

My program read other programs source code and colect information about used SQL queries. I have problem with getting substring.
...
$line = <FILE_IN>;
until( ($line =~m/$values_string/i && $line !~m/$rem_string/i) || eof )
{
if($line =~m/ \S{2}DT\S{3}/i)
{
# here I wish to get (only) substring that match to pattern \S{2}DT\S{3}
# (7 letter table name) and display it.
$line =~/\S{2}DT\S{3}/i;
print $line."\n";
...
In result print prints whole line and not a substring I expect. I tried different approach, but I use Perl seldom and probably make basic concept error. ( position of tablename in line is not fixed. Another problem is multiple occurrence i.e.[... SELECT * FROM AADTTAB, BBDTTAB, ...] ). How can I obtain that substring?
Use grouping with parenthesis and store the first group.
if( $line =~ /(\S{2}DT\S{3})/i )
{
my $substring = $1;
}
The code above fixes the immediate problem of pulling out the first table name. However, the question also asked how to pull out all the table names. So:
# FROM\s+ match FROM followed by one or more spaces
# (.+?) match (non-greedy) and capture any character until...
# (?:x|y) match x OR y - next 2 matches
# [^,]\s+[^,] match non-comma, 1 or more spaces, and non-comma
# \s*; match 0 or more spaces followed by a semi colon
if( $line =~ /FROM\s+(.+?)(?:[^,]\s+[^,]|\s*;)/i )
{
# $1 will be table1, table2, table3
my #tables = split(/\s*,\s*/, $1);
# delim is a space/comma
foreach(#tables)
{
# $_ = table name
print $_ . "\n";
}
}
Result:
If $line = "SELECT * FROM AADTTAB, BBDTTAB;"
Output:
AADTTAB
BBDTTAB
If $line = "SELECT * FROM AADTTAB;"
Output:
AADTTAB
Perl Version: v5.10.0 built for MSWin32-x86-multi-thread
I prefer this:
my ( $table_name ) = $line =~ m/(\S{2}DT\S{3})/i;
This
scans $line and captures the text corresponding to the pattern
returns "all" the captures (1) to the "list" on the other side.
This psuedo-list context is how we catch the first item in a list. It's done the same way as parameters passed to a subroutine.
my ( $first, $second, #rest ) = #_;
my ( $first_capture, $second_capture, #others ) = $feldman =~ /$some_pattern/;
NOTE:: That said, your regex assumes too much about the text to be useful in more than a handful of situations. Not capturing any table name that doesn't have dt as in positions 3 and 4 out of 7? It's good enough for 1) quick-and-dirty, 2) if you're okay with limited applicability.
It would be better to match the pattern if it follows FROM. I assume table names consist solely of ASCII letters. In that case, it is best to say what you want. With those two remarks out of the way, note that a successful capturing regex match in list context returns the matched substring(s).
#!/usr/bin/perl
use strict;
use warnings;
my $s = 'select * from aadttab, bbdttab';
if ( my ($table) = $s =~ /FROM ([A-Z]{2}DT[A-Z]{3})/i ) {
print $table, "\n";
}
__END__
Output:
C:\Temp> s
aadttab
Depending on the version of perl on your system, you may be able to use a named capturing group which might make the whole thing easier to read:
if ( $s =~ /FROM (?<table>[A-Z]{2}DT[A-Z]{3})/i ) {
print $+{table}, "\n";
}
See perldoc perlre.
Parens will let you grab part of the regex into special variables: $1, $2, $3...
So:
$line = ' abc andtabl 1234';
if($line =~m/ (\S{2}DT\S{3})/i) {
# here I wish to get (only) substring that match to pattern \S{2}DT\S{3}
# (7 letter table name) and display it.
print $1."\n";
}
Use a capturing group:
$line =~ /(\S{2}DT\S{3})/i;
my $substr = $1;
$& contains the string matched by the last pattern match.
Example:
$str = "abcdefghijkl";
$str =~ m/cdefg/;
print $&;
# Output: "cdefg"
So you could do something like
if($line =~m/ \S{2}DT\S{3}/i) {
print $&."\n";
}
WARNING:
If you use $& in your code it will slow down all pattern matches.