perl regular expressions replacement - regex

I haven't been able to figure out how to deal with a specific regex problem.
Say I have the a big string that consists of lots of phrases in square brackets. A phrase label (eg S or VP), a token (eg w or wSf), a slash next to that token and then the token's description, (eg CC or VBD_MS3).
So here's an example string:
[S w#/CC] [VP mSf/VBD_MS3]
I want to delete the whole first bracketed phrase and put the w inside of it with the second phrase, like this:
[VP wmSf/VBD_MS3]
Is that even possible using regular expressions?
Edit:
Okay the pattern is:
[ <label> w#/<label>] [<label> <word>/<label> <word>/<label> <word>/<label>...]
(the second bracketed phrase could have one to any number of / pairs)
where can be any sequence of capital letters that might include an underscore, and word can a sequence of anything that's not whitespace (ie digits/characters/special characters).

Without knowing the actual form or positions, one of these forms might work (untested):
s{\[S (\w+)#/\w+\] (\[VP )(\w+/\w+\])}{$2$1$3}g
or
s{\[(?:S/VP) (\w+)#/\w+\] (\[(?:S/VP) )(\w+/\w+\])}{$2$1$3}g
or
s{\[(?:S/VP)\s+(\w+)#/\w+\]\s+(\[(?:S/VP)\s+)(\w+/\w+\])}{$2$1$3}g
Edit
Since your edit has included this pattern
[ <label> w#/<label>] [<label> <word>/<label> <word>/<label> <word>/<label>...]
it makes it easier to come up with a regex that should work.
Good luck!
use strict;
use warnings;
$/ = undef;
my $data = <DATA>;
my $regex = qr{
\[\s* #= Start of token phrase '['
(?&label) \s+ # <label> then whitespace's
((?&word)) # Capture $1 - token word, end grp $1
[#]/(?&label) # '#'/<label>
\s*
\] #= End of token phrase ']'
\s*
( # Capture grp $2
\[\s* #= Start of normal phrase '['
(?&label) \s+ # <label> then whitespace's
) # End grp $2
( # Capture grp $3
(?&word)/(?&label) # First <word>/<label> pair
(?:
\s+(?&word)/(?&label) # Optional, many <word>/<label> pair's
)*
\s*
\] #= End of normal phrase ']'
) # End grp $3
(?(DEFINE) ## DEFINE's:
(?<label> \w+) # <label> - 1 or more word characters
(?<word> [^\s\[\]]+ ) # <word> - 1 or more NOT whitespace, '[' nor ']'
)
}x;
$data =~ s/$regex/$2$1$3/g;
print $data;
__DATA__
[S w#/CC] [VP mSf/VBD_MS3]
Output:
[VP wmSf/VBD_MS3]
Edit2
"if the label of the character is PP, and if the next phrase's label is NP, then change the next phrase's label to PP as well when joining. eg. input: [PP w#/IN] [NP something/NN] output: [PP wsomething/NN]"
Sure, without adding too many new capture groups, it can be done with a callback.
Actually, there are many ways to do this, including regex conditionals. I think the
simplest method is with a callback, where the logic for all label decisions can be made.
use strict;
use warnings;
$/ = undef;
my $data = <DATA>;
my $regex = qr{
( \[\s* # 1 - Token phrase label
(?&label)
\s+
)
( # 2 - Token word
(?&word)
)
[#]/(?&label)
\s*
\]
\s*
( \[\s* # 3 - Normal phrase label
(?&label)
\s+
)
# insert token word ($2) here
( # 4 - The rest ..
(?&word)/(?&label)
(?: \s+ (?&word)/(?&label) )*
\s*
\]
)
(?(DEFINE) ## DEFINE's:
(?<label> \w+) # <label> - 1 or more word characters
(?<word> [^\s\[\]]+ ) # <word> - 1 or more NOT whitespace, '[' nor ']'
)
}x;
$data =~ s/$regex/ checkLabel($1,$3) ."$2$4"/eg;
sub checkLabel
{
my ($p1, $p2) = #_;
if ($p1 =~ /\[\s*PP\s/ && $p2 =~ /(\[\s*)NP(\s)/) {
return $1.'PP'.$2;
# To use the formatting of the token label, just 'return $p1;'
}
return $p2;
}
print $data;
__DATA__
[PP w#/CC] [ NP mSf/VBD_MS3]

Yes,
s|\[S w#/CC\] \[(VP) (mSf/VBD_MS3)\]|[$1 w$2]|;
Now what patterns are you looking for?
You could even do this:
s|\[S (w)#/CC\] \[(VP) (mSf/VBD_MS3)\]|[$2 $1$3]|;

Rather than create a magic regex to do the whole job, why not separate the line into phrases, operate on them then return them. This then follows the same logic that you just explained.
This then cleaner, more readable (especially if you add comments) and robust. Of course you will need to tailor to your needs: for example you may want to make the / separated portions into key/value pairs (does the order matter? if not make a hashref); perhaps you don't need to split on / if you never need to modify the label; etc.
Edit per comments:
This takes a literal w before a #, stores it, removes the phrase, then tacks the w onto the next phrase. If thats what you need then have at it. Of course I'm sure there are edge cases to look out for, so backup and test first!
#!/usr/bin/env perl
use strict;
use warnings;
while( my $line = <DATA> ) {
#separate phrases, then split phases into whitespace separated pieces
my #phrases = map { [split /[\s]/] } ($line =~ /\[([^]]+)\]/g);
my $holder; # holder for 'w' (not really needed if always 'w')
foreach my $p (#phrases) { # for each phrase
if ($p->[1] =~ /(w)#/) { # if the second part has 'w#'
$holder = $1; # keep the 'w' in holder
$p = undef; #empty to mark for cleaning later
next; #move to next phrase
}
if ($holder) { #if the holder is not empty
$p->[1] = $holder . $p->[1]; # add the contents of the holder to the second part of this phrase
$holder = undef; # and then empty the holder
}
}
#remove emptied phrases
#phrases = grep { $_ } #phrases;
#reconstitute the line
print join( ' ', map { '[' . join(' ', #$_) . ']' } #phrases), "\n";
}
__DATA__
[S w#/CC] [VP mSf/VBD_MS3]
Again, it may seem amazing what you can do with one regex, but what happens if your boss comes in and says, "you know, that thing you wrote to do X works great, but now it needs to do Y too". This is why I like to keep nicely separate logic for each logical step.

#/usr/bin/env perl
use strict;
use warnings;
my $str = "[S w#/CC] [VP mSf/VBD_MS3]";
$str =~ s{\[S w#/CC\]\s*(\[VP\s)(.+)}{$1w$2} and print $str;

Related

Using regex to extract a matching pattern from a string and assign it to a variable using perl

I am seeking advice on extracting a section of a string, that is always occurs as the first instance data between parenthesis using perl and regex and assign that value to a variable.
Here is the precise situation, I am using perl and regex to extract the courseID from a university catalog and assign it to a variable. Please consider the following:
BIO-2109-01 (12345) Introduction to Biology
CHM-3501-F2-01 (54321) Introduction to Chemistry
IDS-3250-01 (98765) History of US (1860-2000)
SPN-1234-02-F1 (45678) Spanish History (1900-2010)
The typical format is [course-section-name] [(courseID)] [courseName]
My goal is to create a script which can take each entry, one at a time, assign it to a variable and then use regex to extract only the courseID and assign only the courseID to a variable.
My approach has been to use search and replace to replace everything not matching that with '' and then saving what is left (the courseID) to the variable. Here are a few examples of what I have tried the following:
$string = "BIO-2109-01 (12345) Introduction to Biology";
($courseID = $string) =~ s/[^\d\d\d\d\d]//g;
print $courseID;
Result: 21090112345 --- printing the course-section-name and courseID
$string = "BIO-2109-01 (12345) Introduction to Biology";
$($courseID = $string) =~ s/[^\b\(\d{5}\)]\b//g;
print $courseID;
Result: 210901(12345) --- printing course-section-name, parens, and courseID
So I haven't had much luck with search and replace - however I found this nugget:
\(([^\)]+)\)
On http://regexr.com/ that will match the parens section. However, it would also match multiple parans, including for example (abc).
I'm not really sure at this point how to do something like this:
$string = "BIO-2109-01 (12345) Introduction to Biology";
($courseID = $string) =~ [magicRegex_goes_here];
print courseID;
result 12345
OR, better:
$string = IDS-3250-01 (98765) History of US (1860-2000)
($courseID = $string) =~ [magicRegex_goes_here];
print courseID;
result 98765
Any advice or direction would be greatly appreciated. I have tried everything I know and can research in regards to regex to solve this problem. If there is anymore information I can include please ask away.
UPDATE
use warnings 'all';
use strict;
use feature 'say';
my $file = './data/enrollment.csv'; #File this script generates
my $course = ""; #Complete course string [name-of-course] [(courseID)] [course_name]
my #arrayCourses = ""; #Array of courseIDs
my $i = ""; #i in for loop
my $courseID = ""; #Extracted course ID
my $userName = ""; #Username of person we are enrolling
my $action = "add,"; #What we are doing to user
my $permission = "teacher,"; #What permissions to assign to user
my $stringToPrint = ""; #Concatinated string to write to file
my $n = "\n"; #\n
my $c = ","; #,
#BEGIN PROGRAM
print "Enter the username \n";
chomp($userName = <STDIN>); #Get the enrollee username from user
print "\n";
print "Enter course name and press enter. Enter 'x' to end. \n"; #prompt for course names
while ($course ne 'x') {
chomp($course = <STDIN>);
if ($course ne "x") {
if (($courseID) = ($course =~ /[^(]+\(([^)]+)\)/) ) { #nasty regex to extract courseID - thnx PerlDuck and zdim
push #arrayCourses, $courseID; #put the courseID into array
}
else {
print "Cannot process last entry check it";
}
}
else {
last;
}
}
shift #arrayCourses; #Remove first entry from array - add,teacher,,username
open(my $fh,'>', $file); #open file
for $i (#arrayCourses) #write array to file
{
$stringToPrint= join "", $action, $permission, $i, $c, $userName, $n ;
print $fh $stringToPrint;
}
close $fh;
That'll do it! Suggestions or improvements are always welcome! Thanks #PerlDuck and #zdim
#!/usr/bin/env perl
use strict;
use warnings;
while( my $line = <DATA> ) {
if (my ($courseID) = ($line =~ /[^(]+\(([^)]+)\)/) ) {
print "course-ID = $courseID; -- line was $line";
}
}
__DATA__
BIO-2109-01 (12345) Introduction to Biology
CHM-3501-F2-01 (54321) Introduction to Chemistry
IDS-3250-01 (98765) History of US (1860-2000)
SPN-1234-02-F1 (45678) Spanish History (1900-2010)
Output:
course-ID = 12345; -- line was BIO-2109-01 (12345) Introduction to Biology
course-ID = 54321; -- line was CHM-3501-F2-01 (54321) Introduction to Chemistry
course-ID = 98765; -- line was IDS-3250-01 (98765) History of US (1860-2000)
course-ID = 45678; -- line was SPN-1234-02-F1 (45678) Spanish History (1900-2010)
The pattern I used, /[^(]+\(([^)]+)\)/, can also be written as
/ [^(]+ # 1 or more characters that are not a '('
\( # a literal '('. You must escape that because you don't want
# to start it a capture group.
([^)]+) # 1 or more chars that are not a ')'.
# The sorrounding '(' and ')' capture this match
\) # a literal ')'
/x
The /x modifier allows you to insert spaces, comments, and even newlines right in the pattern.
Just in case you're unsure about the /x. You can indeed write:
while( my $line = <DATA> ) {
if (my ($courseID) = ($line =~ / [^(]+ # …
\( # …
([^)]+) # …
\) # …
/x ) ) {
print "course-ID = $courseID; -- line was $line";
}
}
That's probably not nice to read but you can also store the regex in a separate variable:
my $pattern =
qr/ [^(]+ # 1 or more characters that are not a '('
\( # a literal '(' (you must escape it)
([^)]+) # 1 or more chars that are not a ')'.
# The sorrounding '(' and ')' capture this match
\) # a literal ')'
/x;
And then:
if (my ($courseID) = ($line =~ $pattern)) {
…
}
Since you nailed down the format
my ($section, $id, $name) =
$string =~ /^\s* ([^(]+) \(\s* ([^)]+) \)\s* (.+) $/x;
The key here is the negated character class, [^...], which matches any one character other than those listed inside following the ^ (which makes it "negated"). The un-escaped parenthesis capture the match, except inside a character class [] where they are taken as literal.
It first matches all consecutive characters other than (, so up to first (, what is captured by the pair of ( ) around it. Then all other than ), so up to the first closing paren, also captured by its own pair ( ). This comes between literal parenthesis \( ... \), which are outside of ( ) since we don't want them captured. Then all the rest is captured, (.+), requiring at least some characters since + means one or more. Note though that these can be spaces. We exclude possible leading white space from the first capture, by matching it specifically before the capturing parenthesis, and extract (some of) possible spaces around id-parenthesis.
The /x modifier allows use of spaces (and comments and newlines) inside, what helps reaadbility. The match operator returns a list of all matches, which we assign to variables. Note, even if there is only one match it still returns (it as) a list. See Regular Expressions Tutorial (perlretut).
Then, assuming that you have the catalog in a file
use warnings 'all';
use strict;
use feature 'say';
my $file = 'catalog.txt';
open my $fh, '<', $file or die "Can't open $file: $!";
while (my $line = <$fh>)
{
next if $line =~ /^\s*$/; # skip empty lines
# Strip leading and trailing white space
$line =~ s{^\s*|\s*$}{}g;
my ($section, $id, $name) =
$line =~ /^ ([^(]+) \(\s* ([^)]+) \)\s* (.+) $/x
or do {
warn "Error with expected format -- ";
next;
};
say "$section, $id, $name";
}
close $fh;
I use s{}{} delimiters since s/// confuse markup's syntax highlighter with this pattern, which is also a good demonstration since these sometimes help readability a lot.
You would store the retrieved variables in a suitable data structure. Any combination of arrays and hashes (and their references) comes to mind, depending on what need be done with them later. See Cookbook of Data Structures (perldsc).
Note on the error handling. Since none of the matches involve * (allowing zero matches -- nothing), if any component of your format isn't as expected there won't be a match at all and we get an error. The .+ is extremely permissive but it still requires something to be there. This is why the trailing space is first stripped, so that the last pattern (.+) cannot be satisfied by spaces alone.
If the only objective is the course id and we are certain that the first parenthesis are around it
my ($id) = $line =~ / \(\s* ([^)]+) \) /x or do { ... };
We now only need to match and capture the middle piece, something inside parenthesis.

Regular Expression - Perl: Adding Specificity

I wanted to write a specific regex to do the following in a specific file format.
It should be able to check with a regular expression whether the third field is just an O or has anything following the O.
Currently, I use the following syntax as shown below:
if ($line !~ /^ATOM\s+\d+\s+(O)/)
{
}
Could you guys help me out?
ATOM 284 OD1 ASN 1 34 -7.92000 -6.74600 -4.73800 O_2 1 2 -0.55000 0 0
ATOM 308 O LEU 1 35 -10.48500 -13.59200 -8.35100 O_2 1 2 -0.51000 0 0
I want to be able to print out the lines from a file that contain something after the O. (Such as the OD1 line). I should be able to remove the lines with just an O.
Just add \S (meaning "a non-space character"):
/^ATOM\s+\d+\s+O\S/
Incidentally, I get the impression that you don't actually know regular expressions? I recommend the perlretut ("Perl regular expressions tutorial") manpage.
You currently use !~ for does not match. If you want it to match you'll have to change it to =~. You also don't need the parenthesis are the O. () are used for capture groups. If you wanted to capture the group, you could do (O[A-Za-z0-9]).
if ($line =~ /^ATOM\s+\d+\s+O/)
# we don't care what's after the O, could be nothing or some characters
or
if ($line =~ /^ATOM\s+\d+\s+(O[a-zA-Z0-9]*)/)
# this will capture OD1 or just O in $1
or if you want to see whether there are characters after the 0, you can use
if ($line =~ /^ATOM\s+\d+\s+(O[a-zA-Z0-9]+)/)
# this would only capture OD1 in $1
You can use split to split out that one field:
my $field = ( split /\s+/, $line )[2];
This will make the regular expression easier you want easier to do. Plus, it makes what you're doing more obvious:
if ( $field =~ /^O/ ) {
here be dragons...
}
In fact, you might want to do that for all of your fields to make it easier to manipulate. Since I don't know what your fields mean, I'm just calling them $fld1, $fld2, etc.
my ( $fld1, $fld2, $fld3, $fld4, ... ) = split /\s+/, $line;
if ( $fld3 =~ /^O/ ) {
here be dragons...
}
Now, you can easily refer to your individual fields in your program.
if you're prefer not to use regex you can use split as #David-W said
my #fields = split /\s+/, $line;
##now $field[2]
if ($fields[2] ne 'o'){
##this line has o and other letters
}
but this will be much slower than regex especially for large data files
as for regex your data file starts with white space (apperantly)
so your regex should be as the following
if ($line !~ /^\s+ATOM\s+\d+\s+(O)\s+/){
##this line has o with other letters beside it
} else {
## this line only has o in field 3
}
adding ^\s+ at the beginning or remove ^ mark totally
$line !~ /ATOM\s+\d+\s+(O)\s+/
then add \s+ after o (at the end) to make sure it followed by space immediately
if you're not interested in capturing fields value you better off capture group (o)
if ($line !~ /ATOM\s+\d+\s+O\s+/) {
#...
} else {
#...
}

Perl Regex Multiple Matches

I'm looking for a regular expression that will behave as follows:
input: "hello world."
output: he, el, ll, lo, wo, or, rl, ld
my idea was something along the lines of
while($string =~ m/(([a-zA-Z])([a-zA-Z]))/g) {
print "$1-$2 ";
}
But that does something a little bit different.
It's tricky. You have to capture it, save it, and then force a backtrack.
You can do that this way:
use v5.10; # first release with backtracking control verbs
my $string = "hello, world!";
my #saved;
my $pat = qr{
( \pL {2} )
(?{ push #saved, $^N })
(*FAIL)
}x;
#saved = ();
$string =~ $pat;
my $count = #saved;
printf "Found %d matches: %s.\n", $count, join(", " => #saved);
produces this:
Found 8 matches: he, el, ll, lo, wo, or, rl, ld.
If you do not have v5.10, or you have a headache, you can use this:
my $string = "hello, world!";
my #pairs = $string =~ m{
# we can only match at positions where the
# following sneak-ahead assertion is true:
(?= # zero-width look ahead
( # begin stealth capture
\pL {2} # save off two letters
) # end stealth capture
)
# succeed after matching nothing, force reset
}xg;
my $count = #pairs;
printf "Found %d matches: %s.\n", $count, join(", " => #pairs);
That produces the same output as before.
But you might still have a headache.
No need "to force backtracking"!
push #pairs, "$1$2" while /([a-zA-Z])(?=([a-zA-Z]))/g;
Though you might want to match any letter rather than the limited set you specified.
push #pairs, "$1$2" while /(\pL)(?=(\pL))/g;
Yet another way to do it. Doesn't use any regexp magic, it does use nested maps but this could easily be translated to for loops if desired.
#!/usr/bin/env perl
use strict;
use warnings;
my $in = "hello world.";
my #words = $in =~ /(\b\pL+\b)/g;
my #out = map {
my #chars = split '';
map { $chars[$_] . $chars[$_+1] } ( 0 .. $#chars - 1 );
} #words;
print join ',', #out;
print "\n";
Again, for me this is more readable than a strange regex, YMMV.
I would use captured group in lookahead..
(?=([a-zA-Z]{2}))
------------
|->group 1 captures two English letters
try it here
You can do this by looking for letters and using the pos function to make use of the position of the capture, \G to reference it in another regex, and substr to read a few characters from the string.
use v5.10;
use strict;
use warnings;
my $letter_re = qr/[a-zA-Z]/;
my $string = "hello world.";
while( $string =~ m{ ($letter_re) }gx ) {
# Skip it if the next character isn't a letter
# \G will match where the last m//g left off.
# It's pos() in a regex.
next unless $string =~ /\G $letter_re /x;
# pos() is still where the last m//g left off.
# Use substr to print the character before it (the one we matched)
# and the next one, which we know to be a letter.
say substr $string, pos($string)-1, 2;
}
You can put the "check the next letter" logic inside the original regex with a zero-width positive assertion, (?=pattern). Zero-width meaning it is not captured and does not advance the position of a m//g regex. This is a bit more compact, but zero-width assertions get can get tricky.
while( $string =~ m{ ($letter_re) (?=$letter_re) }gx ) {
# pos() is still where the last m//g left off.
# Use substr to print the character before it (the one we matched)
# and the next one, which we know to be a letter.
say substr $string, pos($string)-1, 2;
}
UPDATE: I'd originally tried to capture both the match and the look ahead as m{ ($letter_re (?=$letter_re)) }gx but that didn't work. The look ahead is zero-width and slips out of the match. Other's answers showed that if you put a second capture inside the look-ahead then it can collapse to just...
say "$1$2" while $string =~ m{ ($letter_re) (?=($letter_re)) }gx;
I leave all the answers here for TMTOWTDI, especially if you're not a regex master.

Counting Matches of Strings As Well as Determine Which Sentence Matches Can Be Found In

I'm doing this in Perl.
I have a text file that contains several paragraphs and 61 sentences.
First, I need to match a series of words that are input on the command line, which I have no trouble at all doing:
my $input = $ARGV[0];
$file =~ m/$input/gi;
Unfortunately, there are some wrinkles-
1. The input can be for multiple items and
2. The multiple items can be on different lines.
I will show you an example:
3 sentences match the pattern "fall|election|2009". The sentences are:
4: "We hate elections."
16: "The dog was injured in a fall from the balcony."
24: "There will be no 2009 fall election."
In this case, the program found counted three sentences within the document that contained either fall, election or 2009, where fall|election|2009 was the input.
My question is twofold:
How do I count the number of sentences that the inputs appear in? I'm very unexperienced with regex, but I would have thought that the default match would try to match the first occurrence of either fall, election, or 2009 that occurred within the file and neither count how many instances there were of each individual word and then add them up. I'm kind of hung up on this, as I don't understand counting with regex at all.
The second part of my question relates to how to first find which sentence the input is found in (i.e. elections appearing in line 4) and how to extract the whole sentence that the input is located in. I think this would be done using first an if: if there is a match within the string to the input then a new scalar equals the text file =~ a substitution? of the sentence... I'm totally unsure.
Edit: I actually have a fully parsed HTML document that I am performing this on. If printed, the output of one example is:
"The Journal is now on Facebook! Check out our page here. It's a work in progress, and we're hungry for your feedback. So let us know what you think on our discussion board, comment below or send us an email. Get breaking news, insider information and curiosities by following The Journal on Twitter. Here are some feeds and writers you might want to follow:"
My command line looks like this: perl WebScan.pl information|writers WebPage000.htm
I have, as aforementioned parsed through the webpage and removed all tags, leaving just text. Now, I have to find the input, which in this case is "information" or "writers". I have to find out how many times these occur within the text of the file (so 2), as well as in which sentence they appear in (so 5 and 6 respectively). I will show you my code so far:
use strict;
use warnings;
my $file;
open (FILENAME, $ARGV[1]);
$file = do { local $/; <FILENAME> };
$file =~ s{
< # open tag
(?: # open group (A)
(!--) | # comment (1) or
(\?) | # another comment (2) or
(?i: # open group (B) for /i
( # one of start tags
SCRIPT | # for which
APPLET | # must be skipped
OBJECT | # all content
STYLE # to correspond
) # end tag (3)
) | # close group (B), or
([!/A-Za-z]) # one of these chars, remember in (4)
) # close group (A)
(?(4) # if previous case is (4)
(?: # open group (C)
(?! # and next is not : (D)
[\s=] # \s or "="
["`'] # with open quotes
) # close (D)
[^>] | # and not close tag or
[\s=] # \s or "=" with
`[^`]*` | # something in quotes ` or
[\s=] # \s or "=" with
'[^']*' | # something in quotes ' or
[\s=] # \s or "=" with
"[^"]*" # something in quotes "
)* # repeat (C) 0 or more times
| # else (if previous case is not (4))
.*? # minimum of any chars
) # end if previous char is (4)
(?(1) # if comment (1)
(?<=--) # wait for "--"
) # end if comment (1)
(?(2) # if another comment (2)
(?<=\?) # wait for "?"
) # end if another comment (2)
(?(3) # if one of tags-containers (3)
</ # wait for end
(?i:\3) # of this tag
(?:\s[^>]*)? # skip junk to ">"
) # end if (3)
> # tag closed
}{}gsx; # STRIP THIS TAG
$file =~ s/&nbsp//gi;
$file =~ s/&#160//gi;
$file =~ s/;//gi;
$file =~ s/[\h\v]+/ /g;
my $count = $file =~ s/((^|\s)\S)/$2/g;
my $sentencecount = $file =~ s/((^|\s)\S).*?(\.|\?|\!)/$1/g;
print "Input file $ARGV[1] contains $sentencecount sentences and $count words.";
So, I need perl to, using $ARGV[0] as keywords, search through a text file, counting the number of times the keyword appears. Then, I need to state what sentence the keyword appeared in (i.e. print the whole sentence in full), as well as the number that the sentence is in.
It's not clear if you have your sentences delimited (or if you have some criteria for split them). If so, and if understand your problem right, you can do something as this:
#words = qw/hi bye 2009 a*d/;
#lines = ('Lets see , hi ',
' hi hi hi ',
' asdadasdas ',
'a2009a',
'hi bye');
$pattern="";
foreach $word (#words) {
$pattern .= quotemeta($word) . '|';
}
chop $pattern; # chop last |
print "pattern='$pattern'\n";
$cont = 0;
foreach $line (#lines) {
$cont++ if $line =~ /$pattern/o;
}
printf "$cont/%d lines matched\n",scalar(#lines);
I build the pattern with quotemeta escaping just in case there are some special characters in the words (as in my example, we dont want it to match).
Edit to match updated question
Okay, let me start with a truism: don't try to parse HTML by yourself. HTML::TreeBuilder is your friend.
For regular expressions, the perlfaq6 is a great source of knowledge.
The following sample works with the following syntax: perl WebScan.pl --regex="information|writers" --filename=WebPage000.htm.
It will print a list of paragraphs and their matches.
#!/usr/bin/perl
use warnings;
use strict;
use HTML::TreeBuilder;
use Data::Dumper;
use Getopt::Long;
my #regexes;
my $filename;
GetOptions('regex=s' => \#regexes, 'filename=s' => \$filename);
my $tb = HTML::TreeBuilder->new_from_file($filename);
$tb->normalize_content;
my #patterns = map { qr/$_/ } #regexes;
my #all;
foreach my $node ($tb->find_by_tag_name('p', 'pre', 'blockquote')) {
my $text = $node->as_text;
my #matches;
foreach my $r (#patterns) {
while ($text =~ /$r/gi) {
push #matches, $&;
}
}
push #all, { paragraph => $text, matches => \#matches } if #matches;
}
foreach (#all) {
print "Paragraph:\n\t$_->{paragraph}\nMatches:\n\t", join(', ', #{$_->{matches}}), "\n";
}
Hopefully, this can point you in the right direction.

In Perl, how can I get the matched substring from a regex?

My program read other programs source code and colect information about used SQL queries. I have problem with getting substring.
...
$line = <FILE_IN>;
until( ($line =~m/$values_string/i && $line !~m/$rem_string/i) || eof )
{
if($line =~m/ \S{2}DT\S{3}/i)
{
# here I wish to get (only) substring that match to pattern \S{2}DT\S{3}
# (7 letter table name) and display it.
$line =~/\S{2}DT\S{3}/i;
print $line."\n";
...
In result print prints whole line and not a substring I expect. I tried different approach, but I use Perl seldom and probably make basic concept error. ( position of tablename in line is not fixed. Another problem is multiple occurrence i.e.[... SELECT * FROM AADTTAB, BBDTTAB, ...] ). How can I obtain that substring?
Use grouping with parenthesis and store the first group.
if( $line =~ /(\S{2}DT\S{3})/i )
{
my $substring = $1;
}
The code above fixes the immediate problem of pulling out the first table name. However, the question also asked how to pull out all the table names. So:
# FROM\s+ match FROM followed by one or more spaces
# (.+?) match (non-greedy) and capture any character until...
# (?:x|y) match x OR y - next 2 matches
# [^,]\s+[^,] match non-comma, 1 or more spaces, and non-comma
# \s*; match 0 or more spaces followed by a semi colon
if( $line =~ /FROM\s+(.+?)(?:[^,]\s+[^,]|\s*;)/i )
{
# $1 will be table1, table2, table3
my #tables = split(/\s*,\s*/, $1);
# delim is a space/comma
foreach(#tables)
{
# $_ = table name
print $_ . "\n";
}
}
Result:
If $line = "SELECT * FROM AADTTAB, BBDTTAB;"
Output:
AADTTAB
BBDTTAB
If $line = "SELECT * FROM AADTTAB;"
Output:
AADTTAB
Perl Version: v5.10.0 built for MSWin32-x86-multi-thread
I prefer this:
my ( $table_name ) = $line =~ m/(\S{2}DT\S{3})/i;
This
scans $line and captures the text corresponding to the pattern
returns "all" the captures (1) to the "list" on the other side.
This psuedo-list context is how we catch the first item in a list. It's done the same way as parameters passed to a subroutine.
my ( $first, $second, #rest ) = #_;
my ( $first_capture, $second_capture, #others ) = $feldman =~ /$some_pattern/;
NOTE:: That said, your regex assumes too much about the text to be useful in more than a handful of situations. Not capturing any table name that doesn't have dt as in positions 3 and 4 out of 7? It's good enough for 1) quick-and-dirty, 2) if you're okay with limited applicability.
It would be better to match the pattern if it follows FROM. I assume table names consist solely of ASCII letters. In that case, it is best to say what you want. With those two remarks out of the way, note that a successful capturing regex match in list context returns the matched substring(s).
#!/usr/bin/perl
use strict;
use warnings;
my $s = 'select * from aadttab, bbdttab';
if ( my ($table) = $s =~ /FROM ([A-Z]{2}DT[A-Z]{3})/i ) {
print $table, "\n";
}
__END__
Output:
C:\Temp> s
aadttab
Depending on the version of perl on your system, you may be able to use a named capturing group which might make the whole thing easier to read:
if ( $s =~ /FROM (?<table>[A-Z]{2}DT[A-Z]{3})/i ) {
print $+{table}, "\n";
}
See perldoc perlre.
Parens will let you grab part of the regex into special variables: $1, $2, $3...
So:
$line = ' abc andtabl 1234';
if($line =~m/ (\S{2}DT\S{3})/i) {
# here I wish to get (only) substring that match to pattern \S{2}DT\S{3}
# (7 letter table name) and display it.
print $1."\n";
}
Use a capturing group:
$line =~ /(\S{2}DT\S{3})/i;
my $substr = $1;
$& contains the string matched by the last pattern match.
Example:
$str = "abcdefghijkl";
$str =~ m/cdefg/;
print $&;
# Output: "cdefg"
So you could do something like
if($line =~m/ \S{2}DT\S{3}/i) {
print $&."\n";
}
WARNING:
If you use $& in your code it will slow down all pattern matches.