In Perl, how can I get the matched substring from a regex?

In Perl, how can I get the matched substring from a regex? - regex

My program read other programs source code and colect information about used SQL queries. I have problem with getting substring.
...
$line = <FILE_IN>;
until( ($line =~m/$values_string/i && $line !~m/$rem_string/i) || eof )
{
if($line =~m/ \S{2}DT\S{3}/i)
{
# here I wish to get (only) substring that match to pattern \S{2}DT\S{3}
# (7 letter table name) and display it.
$line =~/\S{2}DT\S{3}/i;
print $line."\n";
...
In result print prints whole line and not a substring I expect. I tried different approach, but I use Perl seldom and probably make basic concept error. ( position of tablename in line is not fixed. Another problem is multiple occurrence i.e.[... SELECT * FROM AADTTAB, BBDTTAB, ...] ). How can I obtain that substring?

Use grouping with parenthesis and store the first group.
if( $line =~ /(\S{2}DT\S{3})/i )
{
my $substring = $1;
}
The code above fixes the immediate problem of pulling out the first table name. However, the question also asked how to pull out all the table names. So:
# FROM\s+ match FROM followed by one or more spaces
# (.+?) match (non-greedy) and capture any character until...
# (?:x|y) match x OR y - next 2 matches
# [^,]\s+[^,] match non-comma, 1 or more spaces, and non-comma
# \s*; match 0 or more spaces followed by a semi colon
if( $line =~ /FROM\s+(.+?)(?:[^,]\s+[^,]|\s*;)/i )
{
# $1 will be table1, table2, table3
my #tables = split(/\s*,\s*/, $1);
# delim is a space/comma
foreach(#tables)
{
# $_ = table name
print $_ . "\n";
}
}
Result:
If $line = "SELECT * FROM AADTTAB, BBDTTAB;"
Output:
AADTTAB
BBDTTAB
If $line = "SELECT * FROM AADTTAB;"
Output:
AADTTAB
Perl Version: v5.10.0 built for MSWin32-x86-multi-thread

I prefer this:
my ( $table_name ) = $line =~ m/(\S{2}DT\S{3})/i;
This
scans $line and captures the text corresponding to the pattern
returns "all" the captures (1) to the "list" on the other side.
This psuedo-list context is how we catch the first item in a list. It's done the same way as parameters passed to a subroutine.
my ( $first, $second, #rest ) = #_;
my ( $first_capture, $second_capture, #others ) = $feldman =~ /$some_pattern/;
NOTE:: That said, your regex assumes too much about the text to be useful in more than a handful of situations. Not capturing any table name that doesn't have dt as in positions 3 and 4 out of 7? It's good enough for 1) quick-and-dirty, 2) if you're okay with limited applicability.

It would be better to match the pattern if it follows FROM. I assume table names consist solely of ASCII letters. In that case, it is best to say what you want. With those two remarks out of the way, note that a successful capturing regex match in list context returns the matched substring(s).
#!/usr/bin/perl
use strict;
use warnings;
my $s = 'select * from aadttab, bbdttab';
if ( my ($table) = $s =~ /FROM ([A-Z]{2}DT[A-Z]{3})/i ) {
print $table, "\n";
}
__END__
Output:
C:\Temp> s
aadttab
Depending on the version of perl on your system, you may be able to use a named capturing group which might make the whole thing easier to read:
if ( $s =~ /FROM (?<table>[A-Z]{2}DT[A-Z]{3})/i ) {
print $+{table}, "\n";
}
See perldoc perlre.

Parens will let you grab part of the regex into special variables: $1, $2, $3...
So:
$line = ' abc andtabl 1234';
if($line =~m/ (\S{2}DT\S{3})/i) {
# here I wish to get (only) substring that match to pattern \S{2}DT\S{3}
# (7 letter table name) and display it.
print $1."\n";
}

Use a capturing group:
$line =~ /(\S{2}DT\S{3})/i;
my $substr = $1;

$& contains the string matched by the last pattern match.
Example:
$str = "abcdefghijkl";
$str =~ m/cdefg/;
print $&;
# Output: "cdefg"
So you could do something like
if($line =~m/ \S{2}DT\S{3}/i) {
print $&."\n";
}
WARNING:
If you use $& in your code it will slow down all pattern matches.

Related

Applying Filters in Perl using Regex

I'm trying to extract text and numbers from a string using regex in perl. Here is my code:
$line = "finish=100\n";
($var) = $line =~ /[a-z]+/;
($val) = $line =~ /[0-9]+/;
My expected output is that $var = "finish" and $val = 100. However when I run the code $var = 1 and $val = 1.
Any help would be appreciated!!

Use capturing parentheses inside your regular expressions:
$line = "finish=100\n";
($var) = $line =~ /([a-z]+)/;
($val) = $line =~ /([0-9]+)/;
print "$var $val\n";
Refer to perlre

A regex match in list context (where the regex doesn't use the /g flag) returns
the empty list if it fails
a list of captured substrings ($1, $2, ...) if it succeeds and the pattern contains capturing groups
the list 1 if it succeeds and the pattern doesn't capture anything
Your regexes match, but they don't contain any capturing groups, so that's why you get 1 in $var and $val.
If you add capturing groups (/([a-z]+)/, /([0-9]+)/), you get the matched substrings instead.
Note that it might be easier to just do it all in one match:
my ($var, $val) = $line =~ /^([a-z]+)=([0-9]+)$/;
This way you also validate that the input string has the expected form and isn't just something like "Cat o' 9 tails", which (with your original regexes) would extract $var = "at" and $val = "9".

You can too get two values in one array, maybe with this:
$line = "finish=100\n";
#matches = $line =~ /(\w+)\W(\d+)/;
print "$matches[0], $matches[1]";

Catch multiple patterns

I am trying to match a string against multiple patterns and store the captures in an array.
The input can be one of the following:
-fnospacebetween
-f textwithspacebefore
#nospacebetween
# textwithspacebefore
The regex should catch the string after -f or #. Spaces are allowed before the -f and #, also between -f or # and the string .
I thought about using a | splitted regex, but I don't know why it's not catching my input when I use the two regexes in a specific order.
The single case scenario, works as expected:
my $text = '#anystring' ;
if( $text =~ /^\s*\#\s*(\S*)/)
{
print "\n $1";
}
my $text = '-fanystring' ;
if( $text =~ /^\s*-f\s*(\S*)/)
{
print "\n $1";
}
But when I try use the two in one single regex, I get an Use of unitialized... :
my $text = '#anystring' ;
if( $text =~ /^\s*-f\s*(\S*)|^\s*\#\s*(\S*)/)
{
print "\n $1";
}
But with this variant, it works correctly:
my $text = '#anystring' ;
if( $text =~ /^\s*\#\s*(\S*)|^\s*-f\s*(\S*)/)
{
print "\n1: $1";
}
Why it matches correctly when the order is switched?

Why it matches correctly when the order is switched?
This regex
/^\s*\#\s*(\S*)|^\s*-f\s*(\S*)/
will capture into either $1 or $2 depending on which alternative matched. But you are only ever printing $1, which is undef if it were the second alternative that matched
I suggest you use this instead, which has only one capture and uses an alternation on only the part of the pattern that is variable
/^\s*(?:\#|-f)\s*(\S*)/

Another potential problem with your regex is that it will also match
-f -fanother-flag
-# -#another-flag
That is, \S* will match any following flag if there is no argument given to the first flag. Better to use \s*([^-]?\S*) if argument is optional, or \s*([^-]\S*) if mandatory. This still assumes the flag argument cannot begin with hyphen.

Using regex to extract a matching pattern from a string and assign it to a variable using perl

I am seeking advice on extracting a section of a string, that is always occurs as the first instance data between parenthesis using perl and regex and assign that value to a variable.
Here is the precise situation, I am using perl and regex to extract the courseID from a university catalog and assign it to a variable. Please consider the following:
BIO-2109-01 (12345) Introduction to Biology
CHM-3501-F2-01 (54321) Introduction to Chemistry
IDS-3250-01 (98765) History of US (1860-2000)
SPN-1234-02-F1 (45678) Spanish History (1900-2010)
The typical format is [course-section-name] [(courseID)] [courseName]
My goal is to create a script which can take each entry, one at a time, assign it to a variable and then use regex to extract only the courseID and assign only the courseID to a variable.
My approach has been to use search and replace to replace everything not matching that with '' and then saving what is left (the courseID) to the variable. Here are a few examples of what I have tried the following:
$string = "BIO-2109-01 (12345) Introduction to Biology";
($courseID = $string) =~ s/[^\d\d\d\d\d]//g;
print $courseID;
Result: 21090112345 --- printing the course-section-name and courseID
$string = "BIO-2109-01 (12345) Introduction to Biology";
$($courseID = $string) =~ s/[^\b\(\d{5}\)]\b//g;
print $courseID;
Result: 210901(12345) --- printing course-section-name, parens, and courseID
So I haven't had much luck with search and replace - however I found this nugget:
\(([^\)]+)\)
On http://regexr.com/ that will match the parens section. However, it would also match multiple parans, including for example (abc).
I'm not really sure at this point how to do something like this:
$string = "BIO-2109-01 (12345) Introduction to Biology";
($courseID = $string) =~ [magicRegex_goes_here];
print courseID;
result 12345
OR, better:
$string = IDS-3250-01 (98765) History of US (1860-2000)
($courseID = $string) =~ [magicRegex_goes_here];
print courseID;
result 98765
Any advice or direction would be greatly appreciated. I have tried everything I know and can research in regards to regex to solve this problem. If there is anymore information I can include please ask away.
UPDATE
use warnings 'all';
use strict;
use feature 'say';
my $file = './data/enrollment.csv'; #File this script generates
my $course = ""; #Complete course string [name-of-course] [(courseID)] [course_name]
my #arrayCourses = ""; #Array of courseIDs
my $i = ""; #i in for loop
my $courseID = ""; #Extracted course ID
my $userName = ""; #Username of person we are enrolling
my $action = "add,"; #What we are doing to user
my $permission = "teacher,"; #What permissions to assign to user
my $stringToPrint = ""; #Concatinated string to write to file
my $n = "\n"; #\n
my $c = ","; #,
#BEGIN PROGRAM
print "Enter the username \n";
chomp($userName = <STDIN>); #Get the enrollee username from user
print "\n";
print "Enter course name and press enter. Enter 'x' to end. \n"; #prompt for course names
while ($course ne 'x') {
chomp($course = <STDIN>);
if ($course ne "x") {
if (($courseID) = ($course =~ /[^(]+\(([^)]+)\)/) ) { #nasty regex to extract courseID - thnx PerlDuck and zdim
push #arrayCourses, $courseID; #put the courseID into array
}
else {
print "Cannot process last entry check it";
}
}
else {
last;
}
}
shift #arrayCourses; #Remove first entry from array - add,teacher,,username
open(my $fh,'>', $file); #open file
for $i (#arrayCourses) #write array to file
{
$stringToPrint= join "", $action, $permission, $i, $c, $userName, $n ;
print $fh $stringToPrint;
}
close $fh;
That'll do it! Suggestions or improvements are always welcome! Thanks #PerlDuck and #zdim

#!/usr/bin/env perl
use strict;
use warnings;
while( my $line = <DATA> ) {
if (my ($courseID) = ($line =~ /[^(]+\(([^)]+)\)/) ) {
print "course-ID = $courseID; -- line was $line";
}
}
__DATA__
BIO-2109-01 (12345) Introduction to Biology
CHM-3501-F2-01 (54321) Introduction to Chemistry
IDS-3250-01 (98765) History of US (1860-2000)
SPN-1234-02-F1 (45678) Spanish History (1900-2010)
Output:
course-ID = 12345; -- line was BIO-2109-01 (12345) Introduction to Biology
course-ID = 54321; -- line was CHM-3501-F2-01 (54321) Introduction to Chemistry
course-ID = 98765; -- line was IDS-3250-01 (98765) History of US (1860-2000)
course-ID = 45678; -- line was SPN-1234-02-F1 (45678) Spanish History (1900-2010)
The pattern I used, /[^(]+\(([^)]+)\)/, can also be written as
/ [^(]+ # 1 or more characters that are not a '('
\( # a literal '('. You must escape that because you don't want
# to start it a capture group.
([^)]+) # 1 or more chars that are not a ')'.
# The sorrounding '(' and ')' capture this match
\) # a literal ')'
/x
The /x modifier allows you to insert spaces, comments, and even newlines right in the pattern.
Just in case you're unsure about the /x. You can indeed write:
while( my $line = <DATA> ) {
if (my ($courseID) = ($line =~ / [^(]+ # …
\( # …
([^)]+) # …
\) # …
/x ) ) {
print "course-ID = $courseID; -- line was $line";
}
}
That's probably not nice to read but you can also store the regex in a separate variable:
my $pattern =
qr/ [^(]+ # 1 or more characters that are not a '('
\( # a literal '(' (you must escape it)
([^)]+) # 1 or more chars that are not a ')'.
# The sorrounding '(' and ')' capture this match
\) # a literal ')'
/x;
And then:
if (my ($courseID) = ($line =~ $pattern)) {
…
}

Since you nailed down the format
my ($section, $id, $name) =
$string =~ /^\s* ([^(]+) \(\s* ([^)]+) \)\s* (.+) $/x;
The key here is the negated character class, [^...], which matches any one character other than those listed inside following the ^ (which makes it "negated"). The un-escaped parenthesis capture the match, except inside a character class [] where they are taken as literal.
It first matches all consecutive characters other than (, so up to first (, what is captured by the pair of ( ) around it. Then all other than ), so up to the first closing paren, also captured by its own pair ( ). This comes between literal parenthesis \( ... \), which are outside of ( ) since we don't want them captured. Then all the rest is captured, (.+), requiring at least some characters since + means one or more. Note though that these can be spaces. We exclude possible leading white space from the first capture, by matching it specifically before the capturing parenthesis, and extract (some of) possible spaces around id-parenthesis.
The /x modifier allows use of spaces (and comments and newlines) inside, what helps reaadbility. The match operator returns a list of all matches, which we assign to variables. Note, even if there is only one match it still returns (it as) a list. See Regular Expressions Tutorial (perlretut).
Then, assuming that you have the catalog in a file
use warnings 'all';
use strict;
use feature 'say';
my $file = 'catalog.txt';
open my $fh, '<', $file or die "Can't open $file: $!";
while (my $line = <$fh>)
{
next if $line =~ /^\s*$/; # skip empty lines
# Strip leading and trailing white space
$line =~ s{^\s*|\s*$}{}g;
my ($section, $id, $name) =
$line =~ /^ ([^(]+) \(\s* ([^)]+) \)\s* (.+) $/x
or do {
warn "Error with expected format -- ";
next;
};
say "$section, $id, $name";
}
close $fh;
I use s{}{} delimiters since s/// confuse markup's syntax highlighter with this pattern, which is also a good demonstration since these sometimes help readability a lot.
You would store the retrieved variables in a suitable data structure. Any combination of arrays and hashes (and their references) comes to mind, depending on what need be done with them later. See Cookbook of Data Structures (perldsc).
Note on the error handling. Since none of the matches involve * (allowing zero matches -- nothing), if any component of your format isn't as expected there won't be a match at all and we get an error. The .+ is extremely permissive but it still requires something to be there. This is why the trailing space is first stripped, so that the last pattern (.+) cannot be satisfied by spaces alone.
If the only objective is the course id and we are certain that the first parenthesis are around it
my ($id) = $line =~ / \(\s* ([^)]+) \) /x or do { ... };
We now only need to match and capture the middle piece, something inside parenthesis.

Perl how do you assign a varanble to a regex match result

How do you create a $scalar from the result of a regex match?
Is there any way that once the script has matched the regex that it can be assigned to a variable so it can be used later on, outside of the block.
IE. If $regex_result = blah blah then do something.
I understand that I should make the regex as non-greedy as possible.
#!/usr/bin/perl
use strict;
use warnings;
# use diagnostics;
use Win32::OLE;
use Win32::OLE::Const 'Microsoft Outlook';
my #Qmail;
my $regex = "^\\s\*owner \#";
my $sentence = $regex =~ "/^\\s\*owner \#/";
my $outlook = Win32::OLE->new('Outlook.Application')
or warn "Failed Opening Outlook.";
my $namespace = $outlook->GetNamespace("MAPI");
my $folder = $namespace->Folders("test")->Folders("Inbox");
my $items = $folder->Items;
foreach my $msg ( $items->in ) {
if ( $msg->{Subject} =~ m/^(.*test alert) / ) {
my $name = $1;
print " processing Email for $name \n";
push #Qmail, $msg->{Body};
}
}
for(#Qmail) {
next unless /$regex|^\s*description/i;
print; # prints what i want ie lines that start with owner and description
}
print $sentence; # prints ^\\s\*offense \ # not lines that start with owner.

One way is to verify a match occurred.
use strict;
use warnings;
my $str = "hello what world";
my $match = 'no match found';
my $what = 'no what found';
if ( $str =~ /hello (what) world/ )
{
$match = $&;
$what = $1;
}
print '$match = ', $match, "\n";
print '$what = ', $what, "\n";

Use Below Perl variables to meet your requirements -
$` = The string preceding whatever was matched by the last pattern match, not counting patterns matched in nested blocks that have been exited already.
$& = Contains the string matched by the last pattern match
$' = The string following whatever was matched by the last pattern match, not counting patterns matched in nested blockes that have been exited already. For example:
$_ = 'abcdefghi';
/def/;
print "$`:$&:$'\n"; # prints abc:def:ghi

The match of a regex is stored in special variables (as well as some more readable variables if you specify the regex to do so and use the /p flag).
For the whole last match you're looking at the $MATCH (or $& for short) variable. This is covered in the manual page perlvar.
So say you wanted to store your last for loop's matches in an array called #matches, you could write the loop (and for some reason I think you meant it to be a foreach loop) as:
my #matches = ();
foreach (#Qmail) {
next unless /$regex|^\s*description/i;
push #matches_in_qmail $MATCH
print;
}
I think you have a problem in your code. I'm not sure of the original intention but looking at these lines:
my $regex = "^\\s\*owner \#";
my $sentence = $regex =~ "/^\s*owner #/";
I'll step through that as:
Assign $regexto the string ^\s*owner #.
Assign $sentence to value of running a match within $regex with the regular expression /^s*owner $/ (which won't match, if it did $sentence will be 1 but since it didn't it's false).
I think. I'm actually not exactly certain what that line will do or was meant to do.

I'm not quite sure what part of the match you want: the captures, or something else. I've written Regexp::Result which you can use to grab all the captures etc. on a successful match, and Regexp::Flow to grab multiple results (including success statuses). If you just want numbered captures, you can also use Data::Munge

You can do the following:
my $str ="hello world";
my ($hello, $world) = $str =~ /(hello)|(what)/;
say "[$_]" for($hello,$world);
As you see $hello contains "hello".

If you have older perl on your system like me, perl 5.18 or earlier, and you use $ $& $' like codequestor's answer above, it will slow down your program.
Instead, you can use your regex pattern with the modifier /p, and then check these 3 variables: ${^PREMATCH}, ${^MATCH}, and ${^POSTMATCH} for your matching results.

perl regular expression finding pattern only in the front of the text

Suppose there is a text like this:
|-SAMPLE-D2
|---SAMPLE-D1
|---SAMPLE3
I want to count the number of "-" after |.
I tried to parse that by using the following regular expression in perl
$count=()= /-/g;
but this is problematic because the first two has "-" somewhere else in the text as well as in the front. How should I form my regex or use other function in perl to get the number of "-" right after "|"?

Regex to match the dashes after the starting |:
/^\|([\-]*)/

To count dashes that are not preceded by a letter, use a negative look-behind assertion.
$count = () = /(?<!\w)-/g

If the vertical line only ever comes at the start you can get the string of repeating minuses with:
my ($match) = $txt =~ /^\|(-*)/;
The brackets around $match cause the captured portion of the regex to be put into it
then get the number of minuses using
my $minus_count = length($match || '');
The
|| '')
bit
Initialises $match if the regex above found no matches at all, to stop length moaning about uninitialised variables (if you have warnings on)

Not sure if you can count in Regex directly but you can extract capture groups and do a simple arithmetic with their string lengths:
#!/usr/bin/perl
use warnings;
my $inFile = $ARGV[0];
open(FILEHANDLE, "<", $inFile) || die("Could not open file ".$inFile);
my #fileLines = <FILEHANDLE>;
my $lineNo = 0;
my $rslt;
foreach my $line(#fileLines) {
chomp($line);
$line =~ s/^\s+//;
$line =~ s/\s+$//;
$lineNo++;
print "\n".$lineNo." = <".$line.">";
if($line =~ m/^\|-+(.+)/) {
my $text = $1;
print "\n\ttext = <".$text.">";
my $minCnt = length($line) - length($text) - 1;
print "\n\tminus count = <".$minCnt.">";
}
}
close(FILEHANDLE);

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

In Perl, how can I get the matched substring from a regex? - regex

Parens will let you grab part of the regex into special variables: $1, $2, $3... So: $line = ' abc andtabl 1234'; if($line =~m/ (\S{2}DT\S{3})/i) { # here I wish to get (only) substring that match to pattern \S{2}DT\S{3} # (7 letter table name) and display it. print $1."\n"; }

Use a capturing group: $line =~ /(\S{2}DT\S{3})/i; my $substr = $1;

$& contains the string matched by the last pattern match. Example: $str = "abcdefghijkl"; $str =~ m/cdefg/; print $&; # Output: "cdefg" So you could do something like if($line =~m/ \S{2}DT\S{3}/i) { print $&."\n"; } WARNING: If you use $& in your code it will slow down all pattern matches.

Related

Applying Filters in Perl using Regex

Catch multiple patterns

Using regex to extract a matching pattern from a string and assign it to a variable using perl

Perl how do you assign a varanble to a regex match result

perl regular expression finding pattern only in the front of the text

Categories

Resources