Regular Expression - Perl: Adding Specificity - regex

I wanted to write a specific regex to do the following in a specific file format.
It should be able to check with a regular expression whether the third field is just an O or has anything following the O.
Currently, I use the following syntax as shown below:
if ($line !~ /^ATOM\s+\d+\s+(O)/)
{
}
Could you guys help me out?
ATOM 284 OD1 ASN 1 34 -7.92000 -6.74600 -4.73800 O_2 1 2 -0.55000 0 0
ATOM 308 O LEU 1 35 -10.48500 -13.59200 -8.35100 O_2 1 2 -0.51000 0 0
I want to be able to print out the lines from a file that contain something after the O. (Such as the OD1 line). I should be able to remove the lines with just an O.

Just add \S (meaning "a non-space character"):
/^ATOM\s+\d+\s+O\S/
Incidentally, I get the impression that you don't actually know regular expressions? I recommend the perlretut ("Perl regular expressions tutorial") manpage.

You currently use !~ for does not match. If you want it to match you'll have to change it to =~. You also don't need the parenthesis are the O. () are used for capture groups. If you wanted to capture the group, you could do (O[A-Za-z0-9]).
if ($line =~ /^ATOM\s+\d+\s+O/)
# we don't care what's after the O, could be nothing or some characters
or
if ($line =~ /^ATOM\s+\d+\s+(O[a-zA-Z0-9]*)/)
# this will capture OD1 or just O in $1
or if you want to see whether there are characters after the 0, you can use
if ($line =~ /^ATOM\s+\d+\s+(O[a-zA-Z0-9]+)/)
# this would only capture OD1 in $1

You can use split to split out that one field:
my $field = ( split /\s+/, $line )[2];
This will make the regular expression easier you want easier to do. Plus, it makes what you're doing more obvious:
if ( $field =~ /^O/ ) {
here be dragons...
}
In fact, you might want to do that for all of your fields to make it easier to manipulate. Since I don't know what your fields mean, I'm just calling them $fld1, $fld2, etc.
my ( $fld1, $fld2, $fld3, $fld4, ... ) = split /\s+/, $line;
if ( $fld3 =~ /^O/ ) {
here be dragons...
}
Now, you can easily refer to your individual fields in your program.

if you're prefer not to use regex you can use split as #David-W said
my #fields = split /\s+/, $line;
##now $field[2]
if ($fields[2] ne 'o'){
##this line has o and other letters
}
but this will be much slower than regex especially for large data files
as for regex your data file starts with white space (apperantly)
so your regex should be as the following
if ($line !~ /^\s+ATOM\s+\d+\s+(O)\s+/){
##this line has o with other letters beside it
} else {
## this line only has o in field 3
}
adding ^\s+ at the beginning or remove ^ mark totally
$line !~ /ATOM\s+\d+\s+(O)\s+/
then add \s+ after o (at the end) to make sure it followed by space immediately
if you're not interested in capturing fields value you better off capture group (o)
if ($line !~ /ATOM\s+\d+\s+O\s+/) {
#...
} else {
#...
}

Related

Using regex to extract a matching pattern from a string and assign it to a variable using perl

I am seeking advice on extracting a section of a string, that is always occurs as the first instance data between parenthesis using perl and regex and assign that value to a variable.
Here is the precise situation, I am using perl and regex to extract the courseID from a university catalog and assign it to a variable. Please consider the following:
BIO-2109-01 (12345) Introduction to Biology
CHM-3501-F2-01 (54321) Introduction to Chemistry
IDS-3250-01 (98765) History of US (1860-2000)
SPN-1234-02-F1 (45678) Spanish History (1900-2010)
The typical format is [course-section-name] [(courseID)] [courseName]
My goal is to create a script which can take each entry, one at a time, assign it to a variable and then use regex to extract only the courseID and assign only the courseID to a variable.
My approach has been to use search and replace to replace everything not matching that with '' and then saving what is left (the courseID) to the variable. Here are a few examples of what I have tried the following:
$string = "BIO-2109-01 (12345) Introduction to Biology";
($courseID = $string) =~ s/[^\d\d\d\d\d]//g;
print $courseID;
Result: 21090112345 --- printing the course-section-name and courseID
$string = "BIO-2109-01 (12345) Introduction to Biology";
$($courseID = $string) =~ s/[^\b\(\d{5}\)]\b//g;
print $courseID;
Result: 210901(12345) --- printing course-section-name, parens, and courseID
So I haven't had much luck with search and replace - however I found this nugget:
\(([^\)]+)\)
On http://regexr.com/ that will match the parens section. However, it would also match multiple parans, including for example (abc).
I'm not really sure at this point how to do something like this:
$string = "BIO-2109-01 (12345) Introduction to Biology";
($courseID = $string) =~ [magicRegex_goes_here];
print courseID;
result 12345
OR, better:
$string = IDS-3250-01 (98765) History of US (1860-2000)
($courseID = $string) =~ [magicRegex_goes_here];
print courseID;
result 98765
Any advice or direction would be greatly appreciated. I have tried everything I know and can research in regards to regex to solve this problem. If there is anymore information I can include please ask away.
UPDATE
use warnings 'all';
use strict;
use feature 'say';
my $file = './data/enrollment.csv'; #File this script generates
my $course = ""; #Complete course string [name-of-course] [(courseID)] [course_name]
my #arrayCourses = ""; #Array of courseIDs
my $i = ""; #i in for loop
my $courseID = ""; #Extracted course ID
my $userName = ""; #Username of person we are enrolling
my $action = "add,"; #What we are doing to user
my $permission = "teacher,"; #What permissions to assign to user
my $stringToPrint = ""; #Concatinated string to write to file
my $n = "\n"; #\n
my $c = ","; #,
#BEGIN PROGRAM
print "Enter the username \n";
chomp($userName = <STDIN>); #Get the enrollee username from user
print "\n";
print "Enter course name and press enter. Enter 'x' to end. \n"; #prompt for course names
while ($course ne 'x') {
chomp($course = <STDIN>);
if ($course ne "x") {
if (($courseID) = ($course =~ /[^(]+\(([^)]+)\)/) ) { #nasty regex to extract courseID - thnx PerlDuck and zdim
push #arrayCourses, $courseID; #put the courseID into array
}
else {
print "Cannot process last entry check it";
}
}
else {
last;
}
}
shift #arrayCourses; #Remove first entry from array - add,teacher,,username
open(my $fh,'>', $file); #open file
for $i (#arrayCourses) #write array to file
{
$stringToPrint= join "", $action, $permission, $i, $c, $userName, $n ;
print $fh $stringToPrint;
}
close $fh;
That'll do it! Suggestions or improvements are always welcome! Thanks #PerlDuck and #zdim
#!/usr/bin/env perl
use strict;
use warnings;
while( my $line = <DATA> ) {
if (my ($courseID) = ($line =~ /[^(]+\(([^)]+)\)/) ) {
print "course-ID = $courseID; -- line was $line";
}
}
__DATA__
BIO-2109-01 (12345) Introduction to Biology
CHM-3501-F2-01 (54321) Introduction to Chemistry
IDS-3250-01 (98765) History of US (1860-2000)
SPN-1234-02-F1 (45678) Spanish History (1900-2010)
Output:
course-ID = 12345; -- line was BIO-2109-01 (12345) Introduction to Biology
course-ID = 54321; -- line was CHM-3501-F2-01 (54321) Introduction to Chemistry
course-ID = 98765; -- line was IDS-3250-01 (98765) History of US (1860-2000)
course-ID = 45678; -- line was SPN-1234-02-F1 (45678) Spanish History (1900-2010)
The pattern I used, /[^(]+\(([^)]+)\)/, can also be written as
/ [^(]+ # 1 or more characters that are not a '('
\( # a literal '('. You must escape that because you don't want
# to start it a capture group.
([^)]+) # 1 or more chars that are not a ')'.
# The sorrounding '(' and ')' capture this match
\) # a literal ')'
/x
The /x modifier allows you to insert spaces, comments, and even newlines right in the pattern.
Just in case you're unsure about the /x. You can indeed write:
while( my $line = <DATA> ) {
if (my ($courseID) = ($line =~ / [^(]+ # …
\( # …
([^)]+) # …
\) # …
/x ) ) {
print "course-ID = $courseID; -- line was $line";
}
}
That's probably not nice to read but you can also store the regex in a separate variable:
my $pattern =
qr/ [^(]+ # 1 or more characters that are not a '('
\( # a literal '(' (you must escape it)
([^)]+) # 1 or more chars that are not a ')'.
# The sorrounding '(' and ')' capture this match
\) # a literal ')'
/x;
And then:
if (my ($courseID) = ($line =~ $pattern)) {
…
}
Since you nailed down the format
my ($section, $id, $name) =
$string =~ /^\s* ([^(]+) \(\s* ([^)]+) \)\s* (.+) $/x;
The key here is the negated character class, [^...], which matches any one character other than those listed inside following the ^ (which makes it "negated"). The un-escaped parenthesis capture the match, except inside a character class [] where they are taken as literal.
It first matches all consecutive characters other than (, so up to first (, what is captured by the pair of ( ) around it. Then all other than ), so up to the first closing paren, also captured by its own pair ( ). This comes between literal parenthesis \( ... \), which are outside of ( ) since we don't want them captured. Then all the rest is captured, (.+), requiring at least some characters since + means one or more. Note though that these can be spaces. We exclude possible leading white space from the first capture, by matching it specifically before the capturing parenthesis, and extract (some of) possible spaces around id-parenthesis.
The /x modifier allows use of spaces (and comments and newlines) inside, what helps reaadbility. The match operator returns a list of all matches, which we assign to variables. Note, even if there is only one match it still returns (it as) a list. See Regular Expressions Tutorial (perlretut).
Then, assuming that you have the catalog in a file
use warnings 'all';
use strict;
use feature 'say';
my $file = 'catalog.txt';
open my $fh, '<', $file or die "Can't open $file: $!";
while (my $line = <$fh>)
{
next if $line =~ /^\s*$/; # skip empty lines
# Strip leading and trailing white space
$line =~ s{^\s*|\s*$}{}g;
my ($section, $id, $name) =
$line =~ /^ ([^(]+) \(\s* ([^)]+) \)\s* (.+) $/x
or do {
warn "Error with expected format -- ";
next;
};
say "$section, $id, $name";
}
close $fh;
I use s{}{} delimiters since s/// confuse markup's syntax highlighter with this pattern, which is also a good demonstration since these sometimes help readability a lot.
You would store the retrieved variables in a suitable data structure. Any combination of arrays and hashes (and their references) comes to mind, depending on what need be done with them later. See Cookbook of Data Structures (perldsc).
Note on the error handling. Since none of the matches involve * (allowing zero matches -- nothing), if any component of your format isn't as expected there won't be a match at all and we get an error. The .+ is extremely permissive but it still requires something to be there. This is why the trailing space is first stripped, so that the last pattern (.+) cannot be satisfied by spaces alone.
If the only objective is the course id and we are certain that the first parenthesis are around it
my ($id) = $line =~ / \(\s* ([^)]+) \) /x or do { ... };
We now only need to match and capture the middle piece, something inside parenthesis.

Perl grep a multi line output for a pattern

I have the below code where I am trying to grep for a pattern in a variable. The variable has a multiline text in it.
Multiline text in $output looks like this
_skv_version=1
COMPONENTSEQUENCE=C1-
BEGIN_C1
COMPONENT=SecurityJNI
TOOLSEQUENCE=T1-
END_C1
CMD_ID=null
CMD_USES_ASSET_ENV=null_jdk1.7.0_80
CMD_USES_ASSET_ENV=null_ivy,null_jdk1.7.3_80
BEGIN_C1_T1
CMD_ID=msdotnet_VS2013_x64
CMD_ID=ant_1.7.1
CMD_FILE=path/to/abcI.vc12.sln
BEGIN_CMD_OPTIONS_RELEASE
-useideenv
The code I am using to grep for the pattern
use strict;
use warnings;
my $cmd_pattern = "CMD_ID=|CMD_USES_ASSET_ENV=";
my #matching_lines;
my $output = `cmd to get output` ;
print "output is : $output\n";
if ($output =~ /^$cmd_pattern(?:null_)?(\w+([\.]?\w+)*)/s ) {
print "1 is : $1\n";
push (#matching_lines, $1);
}
I am getting the multiline output as expected from $output but the regex pattern match which I am using on $output is not giving me any results.
Desired output
jdk1.7.0_80
ivy
jdk1.7.3_80
msdotnet_VS2013_x64
ant_1.7.1
Regarding your regular expression:
You need a while, not an if (otherwise you'll only be matching once); when you make this change you'll also need the /gc modifiers
You don't really need the /s modifier, as that one makes . match \n, which you're not making use of (see note at the end)
You want to use the /m modifier so that ^ matches the beginning of every new line, and not just the beginning of the string
You want to add \s* to your regular expression right after ^, because in at least one of your lines you have a leading space
You need parenthesis around $cmd_pattern; otherwise, you're getting two options, the first one being ^CMD_ID= and the second one being CMD_USES_ASSET_ENV= followed by the rest of your expression
You can also simplify the (\w+([\.]?\w+)*) bit down to (.+).
The result would be:
while ($output =~ /^\s*(?:$cmd_pattern)(?:null_)?(.+)/gcm ) {
print "1 is : $1\n";
push (#matching_lines, $1);
}
That being said, your regular expression still won't split ivy and jdk1.7.3_80 on its own; I would suggest adding a split and removing _null with something like:
while ($output =~ /^\s*(?:$cmd_pattern)(?:null_)?(.+)/gcm ) {
my $text = $1;
my #text;
if ($text =~ /,/) {
#text = split /,(?:null_)?/, $text;
}
else {
#text = $text;
}
for (#text) {
print "1 is : $_\n";
push (#matching_lines, $_);
}
}
The only problem you're left with is the lone line CMD_ID=null. I'm gonna leave that to you :-)
(I recently wrote a blog post on best practices for regular expressions - http://blog.codacy.com/2016/03/30/best-practices-for-regular-expressions/ - you'll find there a note to always require the /s in Perl; the reason I mention here that you don't need it is that you're not using the ones you actually need, and that might mean you weren't certain of the meaning of /s)

Substitute first character before match

For each line I need to add a semicolon exactly one character before the first match of an alphanumeric sign but only for the alphanumeric sign after the first appearance of a semicolon.
Example:
Input:
00000001;Root;;
00000002; Documents;;
00000003; oracle-advanced_plsql.zip;file;
00000004; Public;;
00000005; backup;;
00000006; 20110323-JM-F.7z.001;file;
00000007; 20110426-JM-F.7z.001;file;
00000008; 20110603-JM-F.7z.001;file;
00000009; 20110701-JM-F-via-summer_school;;
00000010; 20110701-JM-F-via-summer_school.7z.001;file;
Desired output:
00000001;;Root;;
00000002; ;Documents;;
00000003; ;oracle-advanced_plsql.zip;file;
00000004; ;Public;;
00000005; ;backup;;
00000006; ;20110323-JM-F.7z.001;file;
00000007; ;20110426-JM-F.7z.001;file;
00000008; ;20110603-JM-F.7z.001;file;
00000009; ;20110701-JM-F-via-summer_school;;
00000010; ;20110701-JM-F-via-summer_school.7z.001;file;
Could someone helps me please to create Perl regex for that? I'd need it in a program, not as a oneliner.
This is a way to insert a semi-colon after the first semi-colon and whitespace, but before the first non-whitespace.
s/;\s*\K(?=\S)/;/
If you feel the need, you can use \w instead of \S, but I felt with this input it was an unnecessary specification.
The \K (keep) escape is similar to a lookbehind assertion in that it does not remove what it matches. The same goes for the lookahead assertion, so all this substitution does is insert a semi-colon in the designated spot.
First of all, here is a program that seems to match your requirements:
#/usr/bin/perl -w
while(<>) {
s/^(.*?;.*?)(\w)/$1;$2/;
print $_;
}
Store it in a file 'program.pl', make it executable with 'chmod u+x program.pl' and run it on your input data like this:
program.pl input-data.txt
Here is an explanation of the regular expression:
s/ # start search-and-replace regexp
^ # start at the beginning of this line
( # save the matched characters until ')' in $1
.*?; # go forward until finding the first semicolon
.*? # go forward until finding... (to be continued below)
)
( # save the matched characters until ')' in $2
\w # ... the next alphanumeric character.
)
/ # continue with the replace part
$1;$2 # write all characters found above, but insert a ; before $2
/ # finish the search-and-replace regexp.
Based on your sample input, I would use a more specific regular expression:
s/^(\d*; *)(\w)/$1;$2/;
This expression starts at the beginning of the line, skips over numbers (\d*) followed by the first semicolon and space. Before the following word character, it inserts a semicolon.
Take what fits best to your needs!
First of all thank you for your really great answers!
Actually my code snippet looks like this:
our $seperator=";" # at the beginning of the file
#...
sub insert {
my ( $seperator, $line, #all_lines, $count, #all_out );
$count = 0;
#all_lines = read_file($filename);
foreach $line (#all_lines) {
$count = sprintf( "%08d", $count );
chomp $line;
$line =~ s/\:/$seperator/; # works
$line =~ s/\ file/file/; # works
#$line=~s/;\s*\K(?=\S)/;/; # doesn't work
$line =~ s/^(.*?$seperator.*?)(\w)/$1$seperator$2/; # doesn't work
say $count . $seperator . $line . $seperator;
$count++; # btw, is there maybe a hidden index variable in a foreach-loop I could us instead of a new variable??
push( #all_out, $count . $seperator . $line . $seperator . "\n" );
}
write_file( $csvfile, #all_out ); # using File::Slurp
}
In order to get the input which I presented you, I made already some small substitutions, as you can see in the beginning of the foreach-loop.
I am curious, why the regular expressions presented by TLP and Yaakov do not work in my code. In general they work, but only when written like in the example which Yaakov gave:
while(<>) {
s/^(.*?;.*?)(\w)/$1;$2/;
print $_;
}

Counting Matches of Strings As Well as Determine Which Sentence Matches Can Be Found In

I'm doing this in Perl.
I have a text file that contains several paragraphs and 61 sentences.
First, I need to match a series of words that are input on the command line, which I have no trouble at all doing:
my $input = $ARGV[0];
$file =~ m/$input/gi;
Unfortunately, there are some wrinkles-
1. The input can be for multiple items and
2. The multiple items can be on different lines.
I will show you an example:
3 sentences match the pattern "fall|election|2009". The sentences are:
4: "We hate elections."
16: "The dog was injured in a fall from the balcony."
24: "There will be no 2009 fall election."
In this case, the program found counted three sentences within the document that contained either fall, election or 2009, where fall|election|2009 was the input.
My question is twofold:
How do I count the number of sentences that the inputs appear in? I'm very unexperienced with regex, but I would have thought that the default match would try to match the first occurrence of either fall, election, or 2009 that occurred within the file and neither count how many instances there were of each individual word and then add them up. I'm kind of hung up on this, as I don't understand counting with regex at all.
The second part of my question relates to how to first find which sentence the input is found in (i.e. elections appearing in line 4) and how to extract the whole sentence that the input is located in. I think this would be done using first an if: if there is a match within the string to the input then a new scalar equals the text file =~ a substitution? of the sentence... I'm totally unsure.
Edit: I actually have a fully parsed HTML document that I am performing this on. If printed, the output of one example is:
"The Journal is now on Facebook! Check out our page here. It's a work in progress, and we're hungry for your feedback. So let us know what you think on our discussion board, comment below or send us an email. Get breaking news, insider information and curiosities by following The Journal on Twitter. Here are some feeds and writers you might want to follow:"
My command line looks like this: perl WebScan.pl information|writers WebPage000.htm
I have, as aforementioned parsed through the webpage and removed all tags, leaving just text. Now, I have to find the input, which in this case is "information" or "writers". I have to find out how many times these occur within the text of the file (so 2), as well as in which sentence they appear in (so 5 and 6 respectively). I will show you my code so far:
use strict;
use warnings;
my $file;
open (FILENAME, $ARGV[1]);
$file = do { local $/; <FILENAME> };
$file =~ s{
< # open tag
(?: # open group (A)
(!--) | # comment (1) or
(\?) | # another comment (2) or
(?i: # open group (B) for /i
( # one of start tags
SCRIPT | # for which
APPLET | # must be skipped
OBJECT | # all content
STYLE # to correspond
) # end tag (3)
) | # close group (B), or
([!/A-Za-z]) # one of these chars, remember in (4)
) # close group (A)
(?(4) # if previous case is (4)
(?: # open group (C)
(?! # and next is not : (D)
[\s=] # \s or "="
["`'] # with open quotes
) # close (D)
[^>] | # and not close tag or
[\s=] # \s or "=" with
`[^`]*` | # something in quotes ` or
[\s=] # \s or "=" with
'[^']*' | # something in quotes ' or
[\s=] # \s or "=" with
"[^"]*" # something in quotes "
)* # repeat (C) 0 or more times
| # else (if previous case is not (4))
.*? # minimum of any chars
) # end if previous char is (4)
(?(1) # if comment (1)
(?<=--) # wait for "--"
) # end if comment (1)
(?(2) # if another comment (2)
(?<=\?) # wait for "?"
) # end if another comment (2)
(?(3) # if one of tags-containers (3)
</ # wait for end
(?i:\3) # of this tag
(?:\s[^>]*)? # skip junk to ">"
) # end if (3)
> # tag closed
}{}gsx; # STRIP THIS TAG
$file =~ s/&nbsp//gi;
$file =~ s/&#160//gi;
$file =~ s/;//gi;
$file =~ s/[\h\v]+/ /g;
my $count = $file =~ s/((^|\s)\S)/$2/g;
my $sentencecount = $file =~ s/((^|\s)\S).*?(\.|\?|\!)/$1/g;
print "Input file $ARGV[1] contains $sentencecount sentences and $count words.";
So, I need perl to, using $ARGV[0] as keywords, search through a text file, counting the number of times the keyword appears. Then, I need to state what sentence the keyword appeared in (i.e. print the whole sentence in full), as well as the number that the sentence is in.
It's not clear if you have your sentences delimited (or if you have some criteria for split them). If so, and if understand your problem right, you can do something as this:
#words = qw/hi bye 2009 a*d/;
#lines = ('Lets see , hi ',
' hi hi hi ',
' asdadasdas ',
'a2009a',
'hi bye');
$pattern="";
foreach $word (#words) {
$pattern .= quotemeta($word) . '|';
}
chop $pattern; # chop last |
print "pattern='$pattern'\n";
$cont = 0;
foreach $line (#lines) {
$cont++ if $line =~ /$pattern/o;
}
printf "$cont/%d lines matched\n",scalar(#lines);
I build the pattern with quotemeta escaping just in case there are some special characters in the words (as in my example, we dont want it to match).
Edit to match updated question
Okay, let me start with a truism: don't try to parse HTML by yourself. HTML::TreeBuilder is your friend.
For regular expressions, the perlfaq6 is a great source of knowledge.
The following sample works with the following syntax: perl WebScan.pl --regex="information|writers" --filename=WebPage000.htm.
It will print a list of paragraphs and their matches.
#!/usr/bin/perl
use warnings;
use strict;
use HTML::TreeBuilder;
use Data::Dumper;
use Getopt::Long;
my #regexes;
my $filename;
GetOptions('regex=s' => \#regexes, 'filename=s' => \$filename);
my $tb = HTML::TreeBuilder->new_from_file($filename);
$tb->normalize_content;
my #patterns = map { qr/$_/ } #regexes;
my #all;
foreach my $node ($tb->find_by_tag_name('p', 'pre', 'blockquote')) {
my $text = $node->as_text;
my #matches;
foreach my $r (#patterns) {
while ($text =~ /$r/gi) {
push #matches, $&;
}
}
push #all, { paragraph => $text, matches => \#matches } if #matches;
}
foreach (#all) {
print "Paragraph:\n\t$_->{paragraph}\nMatches:\n\t", join(', ', #{$_->{matches}}), "\n";
}
Hopefully, this can point you in the right direction.

In Perl, how can I get the matched substring from a regex?

My program read other programs source code and colect information about used SQL queries. I have problem with getting substring.
...
$line = <FILE_IN>;
until( ($line =~m/$values_string/i && $line !~m/$rem_string/i) || eof )
{
if($line =~m/ \S{2}DT\S{3}/i)
{
# here I wish to get (only) substring that match to pattern \S{2}DT\S{3}
# (7 letter table name) and display it.
$line =~/\S{2}DT\S{3}/i;
print $line."\n";
...
In result print prints whole line and not a substring I expect. I tried different approach, but I use Perl seldom and probably make basic concept error. ( position of tablename in line is not fixed. Another problem is multiple occurrence i.e.[... SELECT * FROM AADTTAB, BBDTTAB, ...] ). How can I obtain that substring?
Use grouping with parenthesis and store the first group.
if( $line =~ /(\S{2}DT\S{3})/i )
{
my $substring = $1;
}
The code above fixes the immediate problem of pulling out the first table name. However, the question also asked how to pull out all the table names. So:
# FROM\s+ match FROM followed by one or more spaces
# (.+?) match (non-greedy) and capture any character until...
# (?:x|y) match x OR y - next 2 matches
# [^,]\s+[^,] match non-comma, 1 or more spaces, and non-comma
# \s*; match 0 or more spaces followed by a semi colon
if( $line =~ /FROM\s+(.+?)(?:[^,]\s+[^,]|\s*;)/i )
{
# $1 will be table1, table2, table3
my #tables = split(/\s*,\s*/, $1);
# delim is a space/comma
foreach(#tables)
{
# $_ = table name
print $_ . "\n";
}
}
Result:
If $line = "SELECT * FROM AADTTAB, BBDTTAB;"
Output:
AADTTAB
BBDTTAB
If $line = "SELECT * FROM AADTTAB;"
Output:
AADTTAB
Perl Version: v5.10.0 built for MSWin32-x86-multi-thread
I prefer this:
my ( $table_name ) = $line =~ m/(\S{2}DT\S{3})/i;
This
scans $line and captures the text corresponding to the pattern
returns "all" the captures (1) to the "list" on the other side.
This psuedo-list context is how we catch the first item in a list. It's done the same way as parameters passed to a subroutine.
my ( $first, $second, #rest ) = #_;
my ( $first_capture, $second_capture, #others ) = $feldman =~ /$some_pattern/;
NOTE:: That said, your regex assumes too much about the text to be useful in more than a handful of situations. Not capturing any table name that doesn't have dt as in positions 3 and 4 out of 7? It's good enough for 1) quick-and-dirty, 2) if you're okay with limited applicability.
It would be better to match the pattern if it follows FROM. I assume table names consist solely of ASCII letters. In that case, it is best to say what you want. With those two remarks out of the way, note that a successful capturing regex match in list context returns the matched substring(s).
#!/usr/bin/perl
use strict;
use warnings;
my $s = 'select * from aadttab, bbdttab';
if ( my ($table) = $s =~ /FROM ([A-Z]{2}DT[A-Z]{3})/i ) {
print $table, "\n";
}
__END__
Output:
C:\Temp> s
aadttab
Depending on the version of perl on your system, you may be able to use a named capturing group which might make the whole thing easier to read:
if ( $s =~ /FROM (?<table>[A-Z]{2}DT[A-Z]{3})/i ) {
print $+{table}, "\n";
}
See perldoc perlre.
Parens will let you grab part of the regex into special variables: $1, $2, $3...
So:
$line = ' abc andtabl 1234';
if($line =~m/ (\S{2}DT\S{3})/i) {
# here I wish to get (only) substring that match to pattern \S{2}DT\S{3}
# (7 letter table name) and display it.
print $1."\n";
}
Use a capturing group:
$line =~ /(\S{2}DT\S{3})/i;
my $substr = $1;
$& contains the string matched by the last pattern match.
Example:
$str = "abcdefghijkl";
$str =~ m/cdefg/;
print $&;
# Output: "cdefg"
So you could do something like
if($line =~m/ \S{2}DT\S{3}/i) {
print $&."\n";
}
WARNING:
If you use $& in your code it will slow down all pattern matches.